Commit Graph

102359 Commits

Author SHA1 Message Date
John Hurley
326367427c net: sched: call reoffload op on block callback reg
Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a
block when a callback tries to register to a block that already has
offloaded rules. If all existing rules cannot be offloaded then the
registration is rejected. This replaces the previous policy of rejecting
such callback registration outright.

On unregistration of a callback, the rules are flushed for that given cb.
The implementation of block sharing in the NFP driver, for example,
duplicates shared rules to all devs bound to a block. This meant that
rules could still exist in hw even after a device is unbound from a block
(assuming the block still remains active).

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:33 +09:00
John Hurley
31533cba43 net: sched: cls_flower: implement offload tcf_proto_op
Add the reoffload tcf_proto_op in flower to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the reoffload function properly maintains this flag, keep a
reference counter for the number of instances of the filter that are in
hardware. Only update the flag when this counter changes from or to 0. Add
a generic helper function to implement this behaviour.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
John Hurley
e56185c78b net: sched: add tcf_proto_op to offload a rule
Create a new tcf_proto_op called 'reoffload' that generates a new offload
message for each node in a tcf_proto. Pointers to the tcf_proto and
whether the offload request is to add or delete the node are included.
Also included is a callback function to send the offload message to and
the option of priv data to go with the cb.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
John Hurley
60513bd82c net: sched: pass extack pointer to block binds and cb registration
Pass the extact struct from a tc qdisc add to the block bind function and,
in turn, to the setup_tc ndo of binding device via the tc_block_offload
struct. Pass this back to any block callback registrations to allow
netlink logging of fails in the bind process.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 23:21:32 +09:00
Yangbo Lu
a8f62d0c6f ptp: support DPAA FMan 1588 timer in ptp_qoriq
This patch is to support DPAA (Data Path Acceleration Architecture)
1588 timer by adding "fsl,fman-ptp-timer" compatible, sharing
interrupt with FMan, adding FSL_DPAA_ETH dependency, and fixing
up register offset.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 22:15:14 +09:00
Yafang Shao
fb223502ec tcp: add SNMP counter for zero-window drops
It will be helpful if we could display the drops due to zero window or no
enough window space.
So a new SNMP MIB entry is added to track this behavior.
This entry is named LINUX_MIB_TCPZEROWINDOWDROP and published in
/proc/net/netstat in TcpExt line as TCPZeroWindowDrop.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 11:49:08 +09:00
David Miller
07d78363dc net: Convert NAPI gro list into a small hash table.
Improve the performance of GRO receive by splitting flows into
multiple hash chains.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 11:33:04 +09:00
David Miller
d4546c2509 net: Convert GRO SKB handling to list_head.
Manage pending per-NAPI GRO packets via list_head.

Return an SKB pointer from the GRO receive handlers.  When GRO receive
handlers return non-NULL, it means that this SKB needs to be completed
at this time and removed from the NAPI queue.

Several operations are greatly simplified by this transformation,
especially timing out the oldest SKB in the list when gro_count
exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
in reverse order.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 11:33:04 +09:00
David S. Miller
9ff3b40e41 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-06-26 08:07:17 +09:00
Linus Torvalds
6f0d349d92 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix netpoll OOPS in r8169, from Ville Syrjälä.

 2) Fix bpf instruction alignment on powerpc et al., from Eric Dumazet.

 3) Don't ignore IFLA_MTU attribute when creating new ipvlan links. From
    Xin Long.

 4) Fix use after free in AF_PACKET, from Eric Dumazet.

 5) Mis-matched RTNL unlock in xen-netfront, from Ross Lagerwall.

 6) Fix VSOCK loopback on big-endian, from Claudio Imbrenda.

 7) Missing RX buffer offset correction when computing DMA addresses in
    mvneta driver, from Antoine Tenart.

 8) Fix crashes in DCCP's ccid3_hc_rx_send_feedback, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits)
  sfc: make function efx_rps_hash_bucket static
  strparser: Corrected typo in documentation.
  qmi_wwan: add support for the Dell Wireless 5821e module
  cxgb4: when disabling dcb set txq dcb priority to 0
  net_sched: remove a bogus warning in hfsc
  net: dccp: switch rx_tstamp_last_feedback to monotonic clock
  net: dccp: avoid crash in ccid3_hc_rx_send_feedback()
  net: Remove depends on HAS_DMA in case of platform dependency
  MAINTAINERS: Add file patterns for dsa device tree bindings
  net: mscc: make sparse happy
  net: mvneta: fix the Rx desc DMA address in the Rx path
  Documentation: e1000: Fix docs build error
  Documentation: e100: Fix docs build error
  Documentation: e1000: Use correct heading adornment
  Documentation: e100: Use correct heading adornment
  ipv6: mcast: fix unsolicited report interval after receiving querys
  vhost_net: validate sock before trying to put its fd
  VSOCK: fix loopback on big-endian systems
  net: ethernet: ti: davinci_cpdma: make function cpdma_desc_pool_create static
  xen-netfront: Update features after registering netdev
  ...
2018-06-25 15:58:17 +08:00
Linus Torvalds
2ce413ec16 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq fixes from Thomas Gleixer:
 "A pile of rseq related fixups:

   - Prevent infinite recursion when delivering SIGSEGV

   - Remove the abort of rseq critical section on fork() as syscalls
     inside rseq critical sections are explicitely forbidden. So no
     point in doing the abort on the child.

   - Align the rseq structure on 32 bytes in the ARM selftest code.

   - Fix file permissions of the test script"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq: Avoid infinite recursion when delivering SIGSEGV
  rseq/cleanup: Do not abort rseq c.s. in child on fork()
  rseq/selftests/arm: Align 'struct rseq_cs' on 32 bytes
  rseq/selftests: Make run_param_test.sh executable
2018-06-24 20:18:19 +08:00
Linus Torvalds
d3a6749c33 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:
 "Two tiny fixes:

   - Add the missing machine_real_restart() to objtools noreturn list so
     it stops complaining

   - Fix a trivial comment typo"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kernel.h: Fix a typo in comment
  objtool: Add machine_real_restart() to the noreturn list
2018-06-24 20:06:42 +08:00
Linus Torvalds
d4e860eaf0 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A set of fixes for x86:

   - Make Xen PV guest deal with speculative store bypass correctly

   - Address more fallout from the 5-Level pagetable handling. Undo an
     __initdata annotation to avoid section mismatch and malfunction
     when post init code would touch the freed variable.

   - Handle exception fixup in math_error() before calling notify_die().
     The reverse call order incorrectly triggers notify_die() listeners
     for soemthing which is handled correctly at the site which issues
     the floating point instruction.

   - Fix an off by one in the LLC topology calculation on AMD

   - Handle non standard memory block sizes gracefully un UV platforms

   - Plug a memory leak in the microcode loader

   - Sanitize the purgatory build magic

   - Add the x86 specific device tree bindings directory to the x86
     MAINTAINER file patterns"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix 'no5lvl' handling
  Revert "x86/mm: Mark __pgtable_l5_enabled __initdata"
  x86/CPU/AMD: Fix LLC ID bit-shift calculation
  MAINTAINERS: Add file patterns for x86 device tree bindings
  x86/microcode/intel: Fix memleak in save_microcode_patch()
  x86/platform/UV: Add kernel parameter to set memory block size
  x86/platform/UV: Use new set memory block size function
  x86/platform/UV: Add adjustable set memory block size function
  x86/build: Remove unnecessary preparation for purgatory
  Revert "kexec/purgatory: Add clean-up for purgatory directory"
  x86/xen: Add call of speculative_store_bypass_ht_init() to PV paths
  x86: Call fixup_exception() before notify_die() in math_error()
2018-06-24 19:59:52 +08:00
Linus Torvalds
2da2ca24a3 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "A set of fixes and updates for the locking code:

   - Prevent lockdep from updating irq state within its own code and
     thereby confusing itself.

   - Buid fix for older GCCs which mistreat anonymous unions

   - Add a missing lockdep annotation in down_read_non_onwer() which
     causes up_read_non_owner() to emit a lockdep splat

   - Remove the custom alpha dec_and_lock() implementation which is
     incorrect in terms of ordering and use the generic one.

  The remaining two commits are not strictly fixes. They provide irqsave
  variants of atomic_dec_and_lock() and refcount_dec_and_lock(). These
  are required to merge the relevant updates and cleanups into different
  maintainer trees for 4.19, so routing them into mainline without
  actual users is the sanest approach.

  They should have been in -rc1, but last weekend I took the liberty to
  just avoid computers in order to regain some mental sanity"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/qspinlock: Fix build for anonymous union in older GCC compilers
  locking/lockdep: Do not record IRQ state within lockdep code
  locking/rwsem: Fix up_read_non_owner() warning with DEBUG_RWSEMS
  locking/refcounts: Implement refcount_dec_and_lock_irqsave()
  atomic: Add irqsave variant of atomic_dec_and_lock()
  alpha: Remove custom dec_and_lock() implementation
2018-06-24 19:36:16 +08:00
Linus Torvalds
78fea6334f Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of fixes mostly for the ARM/GIC world:

   - Fix the MSI affinity handling in the ls-scfg irq chip driver so it
     updates and uses the effective affinity mask correctly

   - Prevent binding LPIs to offline CPUs and respect the Cavium erratum
     which requires that LPIs which belong to an offline NUMA node are
     not bound to a CPU on a different NUMA node.

   - Free only the amount of allocated interrupts in the GIC-V2M driver
     instead of trying to free log2(nrirqs).

   - Prevent emitting SYNC and VSYNC targetting non existing interrupt
     collections in the GIC-V3 ITS driver

   - Ensure that the GIV-V3 interrupt redistributor is correctly
     reprogrammed on CPU hotplug

   - Remove a stale unused helper function"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqdesc: Delete irq_desc_get_msi_desc()
  irqchip/gic-v3-its: Fix reprogramming of redistributors on CPU hotplug
  irqchip/gic-v3-its: Only emit VSYNC if targetting a valid collection
  irqchip/gic-v3-its: Only emit SYNC if targetting a valid collection
  irqchip/gic-v3-its: Don't bind LPI to unavailable NUMA node
  irqchip/gic-v2m: Fix SPI release on error path
  irqchip/ls-scfg-msi: Fix MSI affinity handling
  genirq/debugfs: Add missing IRQCHIP_SUPPORTS_LEVEL_MSI debug
2018-06-24 19:01:18 +08:00
Linus Torvalds
77072ca59f Merge tag 'for-linus-20180623' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - Further timeout fixes. We aren't quite there yet, so expect another
   round of fixes for that to completely close some of the IRQ vs
   completion races. (Christoph/Bart)

 - Set of NVMe fixes from the usual suspects, mostly error handling

 - Two off-by-one fixes (Dan)

 - Another bdi race fix (Jan)

 - Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)

* tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
  blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
  bdi: Fix another oops in wb_workfn()
  lightnvm: Remove depends on HAS_DMA in case of platform dependency
  nvme-pci: limit max IO size and segments to avoid high order allocations
  nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
  nvme-fc: release io queues to allow fast fail
  nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
  block: sed-opal: Fix a couple off by one bugs
  blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
  nvmet: reset keep alive timer in controller enable
  nvme-rdma: don't override opts->queue_size
  nvme-rdma: Fix command completion race at error recovery
  nvme-rdma: fix possible free of a non-allocated async event buffer
  nvme-rdma: fix possible double free condition when failing to create a controller
  Revert "block: Add warning for bi_next not NULL in bio_endio()"
  block: fix timeout changes for legacy request drivers
2018-06-24 06:33:54 +08:00
Linus Torvalds
4ab59fcfd5 Merge tag 'for-linus-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
 "This contains the following fixes/cleanups:

   - the removal of a BUG_ON() which wasn't necessary and which could
     trigger now due to a recent change

   - a correction of a long standing bug happening very rarely in Xen
     dom0 when a hypercall buffer from user land was not accessible by
     the hypervisor for very short periods of time due to e.g. page
     migration or compaction

   - usage of EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() in a
     Xen-related driver (no breakage possible as using those symbols
     without others already exported via EXPORT-SYMBOL_GPL() wouldn't
     make any sense)

   - a simplification for Xen PVH or Xen ARM guests

   - some additional error handling for callers of xenbus_printf()"

* tag 'for-linus-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: Remove unnecessary BUG_ON from __unbind_from_irq()
  xen: add new hypercall buffer mapping device
  xen/scsiback: add error handling for xenbus_printf
  scsi: xen-scsifront: add error handling for xenbus_printf
  xen/grant-table: Export gnttab_{alloc|free}_pages as GPL
  xen: add error handling for xenbus_printf
  xen: share start flags between PV and PVH
2018-06-23 20:44:11 +08:00
Eric Dumazet
5424ea2739 netns: get more entropy from net_hash_mix()
struct net are effectively allocated from order-1 pages on x86,
with one object per slab, meaning that the 13 low order bits
of their addresses are zero.

Once shifted by L1_CACHE_SHIFT, this leaves 7 zero-bits,
meaning that net_hash_mix() does not help spreading
objects on various hash tables.

For example, TCP listen table has 32 buckets, meaning that
all netns use the same bucket for port 80 or port 443.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-23 10:59:56 +09:00
Thomas Gleixner
7731b8bc94 Merge branch 'linus' into x86/urgent
Required to queue a dependent fix.
2018-06-22 21:20:35 +02:00
Jan Kara
3ee7e8697d bdi: Fix another oops in wb_workfn()
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1                            CPU2
                                cgwb_bdi_unregister()
                                  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
                                  wb = list_first_entry(&bdi->wb_list, ...)
                                  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
                                  wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().

Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-22 12:08:07 -06:00
Will Deacon
784e0300fe rseq: Avoid infinite recursion when delivering SIGSEGV
When delivering a signal to a task that is using rseq, we call into
__rseq_handle_notify_resume() so that the registers pushed in the
sigframe are updated to reflect the state of the restartable sequence
(for example, ensuring that the signal returns to the abort handler if
necessary).

However, if the rseq management fails due to an unrecoverable fault when
accessing userspace or certain combinations of RSEQ_CS_* flags, then we
will attempt to deliver a SIGSEGV. This has the potential for infinite
recursion if the rseq code continuously fails on signal delivery.

Avoid this problem by using force_sigsegv() instead of force_sig(), which
is explicitly designed to reset the SEGV handler to SIG_DFL in the case
of a recursive fault. In doing so, remove rseq_signal_deliver() from the
internal rseq API and have an optional struct ksignal * parameter to
rseq_handle_notify_resume() instead.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: peterz@infradead.org
Cc: paulmck@linux.vnet.ibm.com
Cc: boqun.feng@gmail.com
Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com
2018-06-22 19:04:22 +02:00
John Garry
bed9df97b3 irqdesc: Delete irq_desc_get_msi_desc()
Function irq_desc_get_msi_desc() is not referenced in the kernel (and does
not seem to have been referenced since e39758e0ea, 3 years ago), so
delete it.

Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <marc.zyngier@arm.com>
Cc: <will.deacon@arm.com>
Cc: <kstewart@linuxfoundation.org>
Cc: <julien.thierry@arm.com>
Cc: <andrew@lunn.ch>
Cc: <trivial@kernel.org>
Link: https://lkml.kernel.org/r/1529667333-92959-1-git-send-email-john.garry@huawei.com
2018-06-22 14:22:02 +02:00
Marc Zyngier
72a8edc2d9 genirq/debugfs: Add missing IRQCHIP_SUPPORTS_LEVEL_MSI debug
Debug is missing the IRQCHIP_SUPPORTS_LEVEL_MSI debug entry, making debugfs
slightly less useful.

Take this opportunity to also add a missing comment in the definition of
IRQCHIP_SUPPORTS_LEVEL_MSI.

Fixes: 6988e0e0d2 ("genirq/msi: Limit level-triggered MSI to platform devices")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Cc: Sumit Garg <sumit.garg@linaro.org>
Link: https://lkml.kernel.org/r/20180622095254.5906-2-marc.zyngier@arm.com
2018-06-22 14:22:00 +02:00
Eric Dumazet
cadefe5f58 tcp_bbr: fix bbr pacing rate for internal pacing
This commit makes BBR use only the MSS (without any headers) to
calculate pacing rates when internal TCP-layer pacing is used.

This is necessary to achieve the correct pacing behavior in this case,
since tcp_internal_pacing() uses only the payload length to calculate
pacing delays.

Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:59:22 +09:00
Wei Wang
3f6c65d625 tcp: ignore rcv_rtt sample with old ts ecr value
When receiving multiple packets with the same ts ecr value, only try
to compute rcv_rtt sample with the earliest received packet.
This is because the rcv_rtt calculated by later received packets
could possibly include long idle time or other types of delay.
For example:
(1) server sends last packet of reply with TS val V1
(2) client ACKs last packet of reply with TS ecr V1
(3) long idle time passes
(4) client sends next request data packet with TS ecr V1 (again!)
At this time, the rcv_rtt computed on server with TS ecr V1 will be
inflated with the idle time and should get ignored.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:45:01 +09:00
NeilBrown
c0690016a7 rhashtable: clean up dereference of ->future_tbl.
Using rht_dereference_bucket() to dereference
->future_tbl looks like a type error, and could be confusing.
Using rht_dereference_rcu() to test a pointer for NULL
adds an unnecessary barrier - rcu_access_pointer() is preferred
for NULL tests when no lock is held.

This uses 3 different ways to access ->future_tbl.
- if we know the mutex is held, use rht_dereference()
- if we don't hold the mutex, and are only testing for NULL,
  use rcu_access_pointer()
- otherwise (using RCU protection for true dereference),
  use rht_dereference_rcu().

Note that this includes a simplification of the call to
rhashtable_last_table() - we don't do an extra dereference
before the call any more.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:43:28 +09:00
NeilBrown
9b4f64a227 rhashtable: simplify INIT_RHT_NULLS_HEAD()
The 'ht' and 'hash' arguments to INIT_RHT_NULLS_HEAD() are
no longer used - so drop them.  This allows us to also
remove the nhash argument from nested_table_alloc().

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:43:27 +09:00
NeilBrown
9f9a707738 rhashtable: remove nulls_base and related code.
This "feature" is unused, undocumented, and untested and so doesn't
really belong.  A patch is under development to properly implement
support for detecting when a search gets diverted down a different
chain, which the common purpose of nulls markers.

This patch actually fixes a bug too.  The table resizing allows a
table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
any growth beyond 2^27 is wasteful an ineffective.

This patch results in NULLS_MARKER(0) being used for all chains,
and leaves the use of rht_is_a_null() to test for it.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:43:27 +09:00
NeilBrown
0eb71a9da5 rhashtable: split rhashtable.h
Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.

This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code.  rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22 13:43:27 +09:00
Steven Rostedt (VMware)
6cc65be4f6 locking/qspinlock: Fix build for anonymous union in older GCC compilers
One of my tests compiles the kernel with gcc 4.5.3, and I hit the
following build error:

  include/linux/semaphore.h: In function 'sema_init':
  include/linux/semaphore.h:35:17: error: unknown field 'val' specified in initializer
  include/linux/semaphore.h:35:17: warning: missing braces around initializer
  include/linux/semaphore.h:35:17: warning: (near initialization for '(anonymous).raw_lock.<anonymous>.val')

I bisected it down to:

 625e88be1f ("locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'")

... which makes qspinlock have an anonymous union, which makes initializing it special
for older compilers. By adding strategic brackets, it makes the build
happy again.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Fixes: 625e88be1f ("locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'")
Link: http://lkml.kernel.org/r/20180621203526.172ab5c4@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-22 04:19:16 +02:00
Linus Torvalds
27db64f65f Merge tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
 "Hightlights include:

   - fix an rcu deadlock in nfs_delegation_find_inode()

   - fix NFSv4 deadlocks due to not freeing the session slot in
     layoutget

   - don't send layoutreturn if the layout is already invalid

   - prevent duplicate XID allocation

   - flexfiles: Don't tie up all the rpciod threads in resends"

* tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  pNFS/flexfiles: Process writeback resends from nfsiod context as well
  pNFS/flexfiles: Don't tie up all the rpciod threads in resends
  sunrpc: Prevent duplicate XID allocation
  pNFS: Don't send layoutreturn if the layout is already invalid
  pNFS: Always free the session slot on error in nfs4_layoutget_handle_exception
  NFS: Fix an rcu deadlock in nfs_delegation_find_inode()
2018-06-22 06:21:34 +09:00
Linus Torvalds
f43fc5a0c1 Merge tag 'acpi-4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "These fix a suspend/resume regression in the ACPI driver for Intel
  SoCs (LPSS), add a new system wakeup quirk to the ACPI EC driver and
  fix an inline stub of a function in the ACPI processor driver that
  diverged from the original.

  Specifics:

   - Fix a suspend/resume regression in the ACPI driver for Intel SoCs
     (LPSS) to make it work on systems where some power management
     quirks should only be applied for runtime PM and suspend-to-idle
     and not for suspend-to-RAM (Rafael Wysocki).

   - Add a system wakeup quirk for Thinkpad X1 Carbon 6th to the ACPI EC
     driver to avoid drainig battery too fast while suspended to idle on
     those systems (Mika Westerberg).

   - Fix an inline stub of acpi_processor_ppc_has_changed() to match the
     original function definition (Brian Norris)"

* tag 'acpi-4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / processor: Finish making acpi_processor_ppc_has_changed() void
  ACPI / EC: Use ec_no_wakeup on Thinkpad X1 Carbon 6th
  ACPI / LPSS: Avoid PM quirks on suspend and resume from S3
2018-06-22 06:00:13 +09:00
Wei Wang
8730662d7b kernel.h: Fix a typo in comment
Signed-off-by: Wei Wang <wvw@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Crt Mori <cmo@melexis.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gregkh@linuxfoundation.org
Cc: wei.vince.wang@gmail.com
Link: https://lkml.kernel.org/lkml/20180424212241.16013-1-wvw@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 17:39:18 +02:00
mike.travis@hpe.com
f642fb5864 x86/platform/UV: Add adjustable set memory block size function
Add a new function to "adjust" the current fixed UV memory block size
of 2GB so it can be changed to a different physical boundary.  This is
out of necessity so arch dependent code can accommodate specific BIOS
requirements which can align these new PMEM modules at less than the
default boundaries.

A "set order" type of function was used to insure that the memory block
size will be a power of two value without requiring a validity check.
64GB was chosen as the upper limit for memory block size values to
accommodate upcoming 4PB systems which have 6 more bits of physical
address space (46 becoming 52).

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Reviewed-by: Andrew Banman <andrew.banman@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <russ.anderson@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dan.j.williams@intel.com
Cc: jgross@suse.com
Cc: kirill.shutemov@linux.intel.com
Cc: mhocko@suse.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/lkml/20180524201711.609546602@stormcage.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 16:14:45 +02:00
Mathieu Desnoyers
9a789fcfe8 rseq/cleanup: Do not abort rseq c.s. in child on fork()
Considering that we explicitly forbid system calls in rseq critical
sections, it is not valid to issue a fork or clone system call within a
rseq critical section, so rseq_fork() is not required to restart an
active rseq c.s. in the child process.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Link: https://lore.kernel.org/lkml/20180619133230.4087-4-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 12:40:13 +02:00
Eric Dumazet
9262478220 bpf: enforce correct alignment for instructions
After commit 9facc33687 ("bpf: reject any prog that failed read-only lock")
offsetof(struct bpf_binary_header, image) became 3 instead of 4,
breaking powerpc BPF badly, since instructions need to be word aligned.

Fixes: 9facc33687 ("bpf: reject any prog that failed read-only lock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-21 12:46:06 +09:00
Doron Roberts-Kedes
08ba91ee6e nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
If NBD_DISCONNECT_ON_CLOSE is set on a device, then the driver will
issue a disconnect from nbd_release if the device has no remaining
bdev->bd_openers.

Fix ret val so reconfigure with only setting the flag succeeds.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-20 19:10:06 -06:00
Linus Torvalds
1abd8a8f39 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
 "Here are eight fairly small fixes collected over the last two weeks.

  Regression and crashing bug fixes:

   - mlx4/5: Fixes for issues found from various checkers

   - A resource tracking and uverbs regression in the core code

   - qedr: NULL pointer regression found during testing

   - rxe: Various small bugs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/rxe: Fix missing completion for mem_reg work requests
  RDMA/core: Save kernel caller name when creating CQ using ib_create_cq()
  IB/uverbs: Fix ordering of ucontext check in ib_uverbs_write
  IB/mlx4: Fix an error handling path in 'mlx4_ib_rereg_user_mr()'
  RDMA/qedr: Fix NULL pointer dereference when running over iWARP without RDMA-CM
  IB/mlx5: Fix return value check in flow_counters_set_data()
  IB/mlx5: Fix memory leak in mlx5_ib_create_flow
  IB/rxe: avoid double kfree skb
2018-06-21 07:22:30 +09:00
Linus Torvalds
d8894a08d9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix crash on bpf_prog_load() errors, from Daniel Borkmann.

 2) Fix ATM VCC memory accounting, from David Woodhouse.

 3) fib6_info objects need RCU freeing, from Eric Dumazet.

 4) Fix SO_BINDTODEVICE handling for TCP sockets, from David Ahern.

 5) Fix clobbered error code in enic_open() failure path, from
    Govindarajulu Varadarajan.

 6) Propagate dev_get_valid_name() error returns properly, from Li
    RongQing.

 7) Fix suspend/resume in davinci_emac driver, from Bartosz Golaszewski.

 8) Various act_ife fixes (recursive locking, IDR leaks, etc.) from
    Davide Caratti.

 9) Fix buggy checksum handling in sungem driver, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
  ip: limit use of gso_size to udp
  stmmac: fix DMA channel hang in half-duplex mode
  net: stmmac: socfpga: add additional ocp reset line for Stratix10
  net: sungem: fix rx checksum support
  bpfilter: ignore binary files
  bpfilter: fix build error
  net/usb/drivers: Remove useless hrtimer_active check
  net/sched: act_ife: preserve the action control in case of error
  net/sched: act_ife: fix recursive lock and idr leak
  net: ethernet: fix suspend/resume in davinci_emac
  net: propagate dev_get_valid_name return code
  enic: do not overwrite error code
  net/tcp: Fix socket lookups with SO_BINDTODEVICE
  ptp: replace getnstimeofday64() with ktime_get_real_ts64()
  net/ipv6: respect rcu grace period before freeing fib6_info
  net: net_failover: fix typo in net_failover_slave_register()
  ipvlan: use ETH_MAX_MTU as max mtu
  net: hamradio: use eth_broadcast_addr
  enic: initialize enic->rfs_h.lock in enic_probe
  MAINTAINERS: Add Sam as the maintainer for NCSI
  ...
2018-06-21 07:13:42 +09:00
Brian Norris
a507a3065c ACPI / processor: Finish making acpi_processor_ppc_has_changed() void
Commit bca5f557dc "ACPI / processor: Make acpi_processor_ppc_has_changed()
void" changed one of the declarations of acpi_processor_ppc_has_changed()
to return void, but the !CPU_FREQ version still returns int. Let's return
void to be consistent.

Fixes: bca5f557dc "ACPI / processor: Make acpi_processor_ppc_has_changed() void"
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-06-20 10:50:40 +02:00
Linus Torvalds
6d90eb7ba3 Merge tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping rename from Christoph Hellwig:
 "Move all the dma-mapping code to kernel/dma and lose their dma-*
  prefixes"

* tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: move all DMA mapping code to kernel/dma
  dma-mapping: use obj-y instead of lib-y for generic dma ops
2018-06-20 16:30:01 +09:00
Eric Dumazet
9b0a8da8c4 net/ipv6: respect rcu grace period before freeing fib6_info
syzbot reported use after free that is caused by fib6_info being
freed without a proper RCU grace period.

CPU: 0 PID: 1407 Comm: udevd Not tainted 4.17.0+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 __read_once_size include/linux/compiler.h:188 [inline]
 find_rr_leaf net/ipv6/route.c:705 [inline]
 rt6_select net/ipv6/route.c:761 [inline]
 fib6_table_lookup+0x12b7/0x14d0 net/ipv6/route.c:1823
 ip6_pol_route+0x1c2/0x1020 net/ipv6/route.c:1856
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2082
 fib6_rule_lookup+0x211/0x6d0 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2110
 ip6_route_output include/net/ip6_route.h:82 [inline]
 icmpv6_xrlim_allow net/ipv6/icmp.c:211 [inline]
 icmp6_send+0x147c/0x2da0 net/ipv6/icmp.c:535
 icmpv6_send+0x17a/0x300 net/ipv6/ip6_icmp.c:43
 ip6_link_failure+0xa5/0x790 net/ipv6/route.c:2244
 dst_link_failure include/net/dst.h:427 [inline]
 ndisc_error_report+0xd1/0x1c0 net/ipv6/ndisc.c:695
 neigh_invalidate+0x246/0x550 net/core/neighbour.c:892
 neigh_timer_handler+0xaf9/0xde0 net/core/neighbour.c:978
 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1d1/0x200 kernel/softirq.c:404
 exiting_irq arch/x86/include/asm/apic.h:527 [inline]
 smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
 </IRQ>
RIP: 0010:strlen+0x5e/0xa0 lib/string.c:482
Code: 24 00 74 3b 48 bb 00 00 00 00 00 fc ff df 4c 89 e0 48 83 c0 01 48 89 c2 48 89 c1 48 c1 ea 03 83 e1 07 0f b6 14 1a 38 ca 7f 04 <84> d2 75 23 80 38 00 75 de 48 83 c4 08 4c 29 e0 5b 41 5c 5d c3 48
RSP: 0018:ffff8801af117850 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: ffff880197f53bd0 RBX: dffffc0000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81c5b06c RDI: ffff880197f53bc0
RBP: ffff8801af117868 R08: ffff88019a976540 R09: 0000000000000000
R10: ffff88019a976540 R11: 0000000000000000 R12: ffff880197f53bc0
R13: ffff880197f53bc0 R14: ffffffff899e4e90 R15: ffff8801d91c6a00
 strlen include/linux/string.h:267 [inline]
 getname_kernel+0x24/0x370 fs/namei.c:218
 open_exec+0x17/0x70 fs/exec.c:882
 load_elf_binary+0x968/0x5610 fs/binfmt_elf.c:780
 search_binary_handler+0x17d/0x570 fs/exec.c:1653
 exec_binprm fs/exec.c:1695 [inline]
 __do_execve_file.isra.35+0x16fe/0x2710 fs/exec.c:1819
 do_execveat_common fs/exec.c:1866 [inline]
 do_execve fs/exec.c:1883 [inline]
 __do_sys_execve fs/exec.c:1964 [inline]
 __se_sys_execve fs/exec.c:1959 [inline]
 __x64_sys_execve+0x8f/0xc0 fs/exec.c:1959
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f1576a46207
Code: 77 19 f4 48 89 d7 44 89 c0 0f 05 48 3d 00 f0 ff ff 76 e0 f7 d8 64 41 89 01 eb d8 f7 d8 64 41 89 01 eb df b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 00 8c 2d 00 f7 d8 64 89 02
RSP: 002b:00007ffff2784568 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f1576a46207
RDX: 0000000001215b10 RSI: 00007ffff2784660 RDI: 00007ffff2785670
RBP: 0000000000625500 R08: 000000000000589c R09: 000000000000589c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000001215b10
R13: 0000000000000007 R14: 0000000001204250 R15: 0000000000000005

Allocated by task 12188:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
 kmalloc include/linux/slab.h:513 [inline]
 kzalloc include/linux/slab.h:706 [inline]
 fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
 ip6_route_info_create+0x782/0x2b50 net/ipv6/route.c:3013
 ip6_route_add+0x23/0xb0 net/ipv6/route.c:3154
 ipv6_route_ioctl+0x5a5/0x760 net/ipv6/route.c:3660
 inet6_ioctl+0x100/0x1f0 net/ipv6/af_inet6.c:546
 sock_do_ioctl+0xe4/0x3e0 net/socket.c:973
 sock_ioctl+0x30d/0x680 net/socket.c:1097
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x16f0 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 1402:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x260 mm/slab.c:3813
 fib6_info_destroy+0x29b/0x350 net/ipv6/ip6_fib.c:207
 fib6_info_release include/net/ip6_fib.h:286 [inline]
 __ip6_del_rt_siblings net/ipv6/route.c:3235 [inline]
 ip6_route_del+0x11c4/0x13b0 net/ipv6/route.c:3316
 ipv6_route_ioctl+0x616/0x760 net/ipv6/route.c:3663
 inet6_ioctl+0x100/0x1f0 net/ipv6/af_inet6.c:546
 sock_do_ioctl+0xe4/0x3e0 net/socket.c:973
 sock_ioctl+0x30d/0x680 net/socket.c:1097
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x16f0 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801b5df2580
 which belongs to the cache kmalloc-256 of size 256
The buggy address is located 8 bytes inside of
 256-byte region [ffff8801b5df2580, ffff8801b5df2680)
The buggy address belongs to the page:
page:ffffea0006d77c80 count:1 mapcount:0 mapping:ffff8801da8007c0 index:0xffff8801b5df2e40
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffffea0006c5cc48 ffffea0007363308 ffff8801da8007c0
raw: ffff8801b5df2e40 ffff8801b5df2080 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801b5df2480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801b5df2500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> ffff8801b5df2580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                      ^
 ffff8801b5df2600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801b5df2680: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb

Fixes: a64efe142f ("net/ipv6: introduce fib6_info struct and helpers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reported-by: syzbot+9e6d75e3edef427ee888@syzkaller.appspotmail.com
Acked-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-20 07:57:23 +09:00
Trond Myklebust
42f86b44a4 pNFS/flexfiles: Don't tie up all the rpciod threads in resends
We do not want to have rpciod threads perform recursive calls into the
RPC layer since that can deadlock. In particular, having to wait for
a layoutget can be nasty... We want rather to defer scheduling those
retries until we're in the rpc_release() callback, since that is
called from the nfsiod workqueue.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 09:25:27 -04:00
Roger Pau Monne
1fe83888a2 xen: share start flags between PV and PVH
Use a global variable to store the start flags for both PV and PVH.
This allows the xen_initial_domain macro to work properly on PVH.

Note that ARM is also switched to use the new variable.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2018-06-19 13:51:00 +02:00
Bharat Potnuri
7350cdd025 RDMA/core: Save kernel caller name when creating CQ using ib_create_cq()
Few kernel applications like SCST-iSER create CQ using ib_create_cq(),
where accessing CQ structures using rdma restrack tool leads to below NULL
pointer dereference. This patch saves caller kernel module name similar to
ib_alloc_cq().

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8132ca70>] skip_spaces+0x30/0x30
PGD 738bac067 PUD 8533f0067 PMD 0
Oops: 0000 [#1] SMP
R10: ffff88017fc03300 R11: 0000000000000246 R12: 0000000000000000
R13: ffff88082fa5a668 R14: ffff88017475a000 R15: 0000000000000000
FS:  00002b32726582c0(0000) GS:ffff88087fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000008491a1000 CR4: 00000000003607e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffffc05af69c>] ? fill_res_name_pid+0x7c/0x90 [ib_core]
 [<ffffffffc05af79f>] fill_res_cq_entry+0xef/0x170 [ib_core]
 [<ffffffffc05af4c4>] res_get_common_dumpit+0x3c4/0x480 [ib_core]
 [<ffffffffc05af5d3>] nldev_res_get_cq_dumpit+0x13/0x20 [ib_core]
 [<ffffffff815bc1e7>] netlink_dump+0x117/0x2e0
 [<ffffffff815bcb8b>] __netlink_dump_start+0x1ab/0x230
 [<ffffffffc059fead>] ibnl_rcv_msg+0x11d/0x1f0 [ib_core]
 [<ffffffffc05af5c0>] ? nldev_res_get_mr_dumpit+0x20/0x20 [ib_core]
 [<ffffffffc059fd90>] ? rdma_nl_multicast+0x30/0x30 [ib_core]
 [<ffffffff815bea49>] netlink_rcv_skb+0xa9/0xc0
 [<ffffffffc05a0018>] ibnl_rcv+0x98/0xb0 [ib_core]
 [<ffffffff815be132>] netlink_unicast+0xf2/0x1b0
 [<ffffffff815be50f>] netlink_sendmsg+0x31f/0x6a0
 [<ffffffff8156b580>] sock_sendmsg+0xb0/0xf0
 [<ffffffff816ace9e>] ? _raw_spin_unlock_bh+0x1e/0x20
 [<ffffffff8156f998>] ? release_sock+0x118/0x170
 [<ffffffff8156b731>] SYSC_sendto+0x121/0x1c0
 [<ffffffff81568340>] ? sock_alloc_file+0xa0/0x140
 [<ffffffff81221265>] ? __fd_install+0x25/0x60
 [<ffffffff8156c2ce>] SyS_sendto+0xe/0x10
 [<ffffffff816b6c2a>] system_call_fastpath+0x16/0x1b
RIP  [<ffffffff8132ca70>] skip_spaces+0x30/0x30
RSP <ffff88072be97760>
CR2: 0000000000000000

Cc: <stable@vger.kernel.org>
Fixes: f66c8ba4c9 ("RDMA/core: Save kernel caller name when creating PD and CQ objects")
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 11:32:58 -06:00
Linus Torvalds
a569631306 Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
Pull dmi update from Jean Delvare:
 "Expose SKU ID string as a DMI attribute"

* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  firmware: dmi: Add access to the SKU ID string
2018-06-18 13:43:09 +09:00
Simon Glass
b23908d3c4 firmware: dmi: Add access to the SKU ID string
This is used in some systems from user space for determining the identity
of the device.

Expose this as a file so that that user-space tools don't need to read
from /sys/firmware/dmi/tables/DMI

Signed-off-by: Simon Glass <sjg@chromium.org>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
2018-06-17 14:09:42 +02:00
David Woodhouse
9bbe60a67b atm: Preserve value of skb->truesize when accounting to vcc
ATM accounts for in-flight TX packets in sk_wmem_alloc of the VCC on
which they are to be sent. But it doesn't take ownership of those
packets from the sock (if any) which originally owned them. They should
remain owned by their actual sender until they've left the box.

There's a hack in pskb_expand_head() to avoid adjusting skb->truesize
for certain skbs, precisely to avoid messing up sk_wmem_alloc
accounting. Ideally that hack would cover the ATM use case too, but it
doesn't — skbs which aren't owned by any sock, for example PPP control
frames, still get their truesize adjusted when the low-level ATM driver
adds headroom.

This has always been an issue, it seems. The truesize of a packet
increases, and sk_wmem_alloc on the VCC goes negative. But this wasn't
for normal traffic, only for control frames. So I think we just got away
with it, and we probably needed to send 2GiB of LCP echo frames before
the misaccounting would ever have caused a problem and caused
atm_may_send() to start refusing packets.

Commit 14afee4b60 ("net: convert sock.sk_wmem_alloc from atomic_t to
refcount_t") did exactly what it was intended to do, and turned this
mostly-theoretical problem into a real one, causing PPPoATM to fail
immediately as sk_wmem_alloc underflows and atm_may_send() *immediately*
starts refusing to allow new packets.

The least intrusive solution to this problem is to stash the value of
skb->truesize that was accounted to the VCC, in a new member of the
ATM_SKB(skb) structure. Then in atm_pop_raw() subtract precisely that
value instead of the then-current value of skb->truesize.

Fixes: 158f323b98 ("net: adjust skb->truesize in pskb_expand_head()")
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Tested-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-17 08:27:01 +09:00
David S. Miller
0841d98641 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-06-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a panic in devmap handling in generic XDP where return type
   of __devmap_lookup_elem() got changed recently but generic XDP
   code missed the related update, from Toshiaki.

2) Fix a freeze when BPF progs are loaded that include BPF to BPF
   calls when JIT is enabled where we would later bail out via error
   path w/o dropping kallsyms, and another one to silence syzkaller
   splats from locking prog read-only, from Daniel.

3) Fix a bug in test_offloads.py BPF selftest which must not assume
   that the underlying system have no BPF progs loaded prior to test,
   and one in bpftool to fix accuracy of program load time, from Jakub.

4) Fix a bug in bpftool's probe for availability of the bpf(2)
   BPF_TASK_FD_QUERY subcommand, from Yonghong.

5) Fix a regression in AF_XDP's XDP_SKB receive path where queue
   id check got erroneously removed, from Björn.

6) Fix missing state cleanup in BPF's xfrm tunnel test, from William.

7) Check tunnel type more accurately in BPF's tunnel collect metadata
   kselftest, from Jian.

8) Fix missing Kconfig fragments for BPF kselftests, from Anders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-17 07:54:24 +09:00
Linus Torvalds
265c5596da Merge tag 'for-linus-20180616' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A collection of fixes that should go into -rc1. This contains:

   - bsg_open vs bsg_unregister race fix (Anatoliy)

   - NVMe pull request from Christoph, with fixes for regressions in
     this window, FC connect/reconnect path code unification, and a
     trace point addition.

   - timeout fix (Christoph)

   - remove a few unused functions (Christoph)

   - blk-mq tag_set reinit fix (Roman)"

* tag 'for-linus-20180616' of git://git.kernel.dk/linux-block:
  bsg: fix race of bsg_open and bsg_unregister
  block: remov blk_queue_invalidate_tags
  nvme-fabrics: fix and refine state checks in __nvmf_check_ready
  nvme-fabrics: handle the admin-only case properly in nvmf_check_ready
  nvme-fabrics: refactor queue ready check
  blk-mq: remove blk_mq_tagset_iter
  nvme: remove nvme_reinit_tagset
  nvme-fc: fix nulling of queue data on reconnect
  nvme-fc: remove reinit_request routine
  blk-mq: don't time out requests again that are in the timeout handler
  nvme-fc: change controllers first connect to use reconnect path
  nvme: don't rely on the changed namespace list log
  nvmet: free smart-log buffer after use
  nvme-rdma: fix error flow during mapping request data
  nvme: add bio remapping tracepoint
  nvme: fix NULL pointer dereference in nvme_init_subsystem
  blk-mq: reinit q->tag_set_list entry only after grace period
2018-06-17 05:37:55 +09:00