CPUs can go offline shortly after kfree_call_rcu() has been invoked,
which can leave memory stranded until those CPUs come back online.
This commit therefore drains the kcrp of each CPU, not just the
ones that happen to be online.
Acked-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_segcblist_accelerate() function returns true iff it is necessary
to request another grace period. A tracing session showed that this
function unnecessarily requests grace periods.
For example, consider the following sequence of events:
1. Callbacks are queued only on the NEXT segment of CPU A's callback list.
2. CPU A runs RCU_SOFTIRQ, accelerating these callbacks from NEXT to WAIT.
3. Thus rcu_segcblist_accelerate() returns true, requesting grace period N.
4. RCU's grace-period kthread wakes up on CPU B and starts grace period N.
4. CPU A notices the new grace period and invokes RCU_SOFTIRQ.
5. CPU A's RCU_SOFTIRQ again invokes rcu_segcblist_accelerate(), but
there are no new callbacks. However, rcu_segcblist_accelerate()
nevertheless (uselessly) requests a new grace period N+1.
This extra grace period results in additional lock contention and also
additional wakeups, all for no good reason.
This commit therefore adds a check to rcu_segcblist_accelerate() that
prevents the return of true when there are no new callbacks.
This change reduces the number of grace periods (GPs) and wakeups in each
of eleven five-second rcutorture runs as follows:
+----+-------------------+-------------------+
| # | Number of GPs | Number of Wakeups |
+====+=========+=========+=========+=========+
| 1 | With | Without | With | Without |
+----+---------+---------+---------+---------+
| 2 | 75 | 89 | 113 | 119 |
+----+---------+---------+---------+---------+
| 3 | 62 | 91 | 105 | 123 |
+----+---------+---------+---------+---------+
| 4 | 60 | 79 | 98 | 110 |
+----+---------+---------+---------+---------+
| 5 | 63 | 79 | 99 | 112 |
+----+---------+---------+---------+---------+
| 6 | 57 | 89 | 96 | 123 |
+----+---------+---------+---------+---------+
| 7 | 64 | 85 | 97 | 118 |
+----+---------+---------+---------+---------+
| 8 | 58 | 83 | 98 | 113 |
+----+---------+---------+---------+---------+
| 9 | 57 | 77 | 89 | 104 |
+----+---------+---------+---------+---------+
| 10 | 66 | 82 | 98 | 119 |
+----+---------+---------+---------+---------+
| 11 | 52 | 82 | 83 | 117 |
+----+---------+---------+---------+---------+
The reduction in the number of wakeups ranges from 5% to 40%.
Cc: urezki@gmail.com
[ paulmck: Rework commit log and comment. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The RCU grace-period kthread's force-quiescent state (FQS) loop should
never see an offline CPU that has not yet reported a quiescent state.
After all, the offline CPU should have reported a quiescent state
during the CPU-offline process, or, failing that, by rcu_gp_init()
if it ran concurrently with either the CPU going offline or the last
task on a leaf rcu_node structure exiting its RCU read-side critical
section while all CPUs corresponding to that structure are offline.
The FQS loop should therefore complain if it does see an offline CPU
that has not yet reported a quiescent state.
And it does, but only once the grace period has been in force for a
full second. This commit therefore makes this warning more aggressive,
so that it will trigger as soon as the condition makes its appearance.
Light testing with TREE03 and hotplug shows no warnings. This commit
also converts the warning to WARN_ON_ONCE() in order to stave off possible
log spam.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Since at least v4.19, the FQS loop no longer reports quiescent states
for offline CPUs except in emergency situations.
This commit therefore fixes the comment in rcu_gp_init() to match the
current code.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit increases RCU's ability to defend itself by emitting a warning
if one of the nocb CB kthreads invokes the GP kthread's wait function.
This warning augments a similar check that is carried out at the end
of rcutorture testing and when RCU CPU stall warnings are emitted.
The problem with those checks is that the miscreants have long since
departed and disposed of any and all evidence.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the rcu_cpu_started per-CPU variable was added by commit
f64c6013a2 ("rcu/x86: Provide early rcu_cpu_starting() callback"),
there were multiple sets of per-CPU rcu_data structures. Therefore, the
rcu_cpu_started flag was added as a separate per-CPU variable. But now
there is only one set of per-CPU rcu_data structures, so this commit
moves rcu_cpu_started to a new ->cpu_started field in that structure.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Given that sysfs can change the value of rcu_cpu_stall_ftrace_dump at any
time, this commit adds a READ_ONCE() to the accesses to that variable.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Given that sysfs can change the value of rcu_kick_kthreads at any time,
this commit adds a READ_ONCE() to the sole access to that variable.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Given that sysfs can change the value of rcu_resched_ns at any time,
this commit adds a READ_ONCE() to the sole access to that variable.
While in the area, this commit also adds bounds checking, clamping the
value to at least a millisecond, but no longer than a second.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Given that sysfs can change the value of rcu_divisor at any time, this
commit adds a READ_ONCE to the sole access to that variable. While in
the area, this commit also adds bounds checking, clamping the value to
a shift that makes sense for a signed long.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_data structure's ->nocb_timer field is used to defer wakeups of
the corresponding no-CBs CPU's grace-period kthread ("rcuog*"), and that
structure's ->nocb_defer_wakeup field is used to track such deferral.
This means that the show_rcu_nocb_state() printing an error when those
fields are set for a CPU not corresponding to a no-CBs grace-period
kthread is erroneous.
This commit therefore switches the check from ->nocb_timer to
->nocb_bypass_timer and removes the check of ->nocb_defer_wakeup.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Originally, the call to rcu_preempt_blocked_readers_cgp() from
force_qs_rnp() had to be conditioned on CONFIG_PREEMPT_RCU=y, as in
commit a77da14ce9 ("rcu: Yet another fix for preemption and CPU
hotplug"). However, there is now a CONFIG_PREEMPT_RCU=n definition of
rcu_preempt_blocked_readers_cgp() that unconditionally returns zero, so
invoking it is now safe. In addition, the CONFIG_PREEMPT_RCU=n definition
of rcu_initiate_boost() simply releases the rcu_node structure's ->lock,
which is what happens when the "if" condition evaluates to false.
This commit therefore drops the IS_ENABLED(CONFIG_PREEMPT_RCU) check,
so that rcu_initiate_boost() is called only in CONFIG_PREEMPT_RCU=y
kernels when there are readers blocking the current grace period.
This does not change the behavior, but reduces code-reader confusion by
eliminating non-CONFIG_PREEMPT_RCU=y calls to rcu_initiate_boost().
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
On callback overload, it is necessary to quickly detect idle CPUs,
and rcu_gp_fqs_check_wake() checks for this condition. Unfortunately,
the code following the call to this function does not repeat this check,
which means that in reality no actual quiescent-state forcing, instead
only a couple of quick and pointless wakeups at the beginning of the
grace period.
This commit therefore adds a check for the RCU_GP_FLAG_OVLD flag in
the post-wakeup "if" statement in rcu_gp_fqs_loop().
Fixes: 1fca4d12f4 ("rcu: Expedite first two FQS scans under callback-overload conditions")
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
A message of the form "rcu: !!! lDTs ." can be tracked down, but
doing so is not trivial. This commit therefore eases this process by
adding text so that this error message now reads as follows:
"rcu: nocb GP activity on CB-only CPU!!! lDTs ."
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
During acceleration of CB, the rsp's gp_seq is rcu_seq_snap'd. This is
the value used for acceleration - it is the value of gp_seq at which it
is safe the execute all callbacks in the callback list.
The rdp's gp_seq is not very useful for this scenario. Make
rcu_grace_period report the gp_seq_req instead as it allows one to
reason about how the acceleration works.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit moves the initialization of the CONFIG_PREEMPT=n version of
the rcu_exp_handler() function's rdp and rnp local variables into their
respective declarations to save a couple lines of code.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN is now in mainline, so this commit removes the stubs for the
data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS()
macros.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN is now in mainline, so this commit removes the stubs for the
data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS()
macros.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN is now in mainline, so this commit removes the stubs for the
data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS()
macros.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Pull io_uring fixes from Jens Axboe:
"A few differerent things in here.
Seems like syzbot got some more io_uring bits wired up, and we got a
handful of reports and the associated fixes are in here.
General fixes too, and a lot of them marked for stable.
Lastly, a bit of fallout from the async buffered reads, where we now
more easily trigger short reads. Some applications don't really like
that, so the io_read() code now handles short reads internally, and
got a cleanup along the way so that it's now easier to read (and
documented). We're now passing tests that failed before"
* tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
io_uring: short circuit -EAGAIN for blocking read attempt
io_uring: sanitize double poll handling
io_uring: internally retry short reads
io_uring: retain iov_iter state over io_read/io_write calls
task_work: only grab task signal lock when needed
io_uring: enable lookup of links holding inflight files
io_uring: fail poll arm on queue proc failure
io_uring: hold 'ctx' reference around task_work queue + execute
fs: RWF_NOWAIT should imply IOCB_NOIO
io_uring: defer file table grabbing request cleanup for locked requests
io_uring: add missing REQ_F_COMP_LOCKED for nested requests
io_uring: fix recursive completion locking on oveflow flush
io_uring: use TWA_SIGNAL for task_work uncondtionally
io_uring: account locked memory before potential error case
io_uring: set ctx sq/cq entry count earlier
io_uring: Fix NULL pointer dereference in loop_rw_iter()
io_uring: add comments on how the async buffered read retry works
io_uring: io_async_buf_func() need not test page bit
Pull arch/sh updates from Rich Felker:
"Cleanup, SECCOMP_FILTER support, message printing fixes, and other
changes to arch/sh"
* tag 'sh-for-5.9' of git://git.libc.org/linux-sh: (34 commits)
sh: landisk: Add missing initialization of sh_io_port_base
sh: bring syscall_set_return_value in line with other architectures
sh: Add SECCOMP_FILTER
sh: Rearrange blocks in entry-common.S
sh: switch to copy_thread_tls()
sh: use the generic dma coherent remap allocator
sh: don't allow non-coherent DMA for NOMMU
dma-mapping: consolidate the NO_DMA definition in kernel/dma/Kconfig
sh: unexport register_trapped_io and match_trapped_io_handler
sh: don't include <asm/io_trapped.h> in <asm/io.h>
sh: move the ioremap implementation out of line
sh: move ioremap_fixed details out of <asm/io.h>
sh: remove __KERNEL__ ifdefs from non-UAPI headers
sh: sort the selects for SUPERH alphabetically
sh: remove -Werror from Makefiles
sh: Replace HTTP links with HTTPS ones
arch/sh/configs: remove obsolete CONFIG_SOC_CAMERA*
sh: stacktrace: Remove stacktrace_ops.stack()
sh: machvec: Modernize printing of kernel messages
sh: pci: Modernize printing of kernel messages
...
Pull x86 fixes from Ingo Molnar:
"Misc fixes and small updates all around the place:
- Fix mitigation state sysfs output
- Fix an FPU xstate/sxave code assumption bug triggered by
Architectural LBR support
- Fix Lightning Mountain SoC TSC frequency enumeration bug
- Fix kexec debug output
- Fix kexec memory range assumption bug
- Fix a boundary condition in the crash kernel code
- Optimize porgatory.ro generation a bit
- Enable ACRN guests to use X2APIC mode
- Reduce a __text_poke() IRQs-off critical section for the benefit of
PREEMPT_RT"
* tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Acquire pte lock with interrupts enabled
x86/bugs/multihit: Fix mitigation reporting when VMX is not in use
x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs
x86/purgatory: Don't generate debug info for purgatory.ro
x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC
kexec_file: Correctly output debugging information for the PT_LOAD ELF header
kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges
x86/crash: Correct the address boundary of function parameters
x86/acrn: Remove redundant chars from ACRN signature
x86/acrn: Allow ACRN guest to use X2APIC mode
Pull scheduler fixes from Ingo Molnar:
"Two fixes: fix a new tracepoint's output value, and fix the formatting
of show-state syslog printouts"
* tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Fix the alignment of the show-state debug output
sched: Fix use of count for nr_running tracepoint
Pull perf fixes from Ingo Molnar:
"Misc fixes, an expansion of perf syscall access to CAP_PERFMON
privileged tools, plus a RAPL HW-enablement for Intel SPR platforms"
* tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/rapl: Add support for Intel SPR platform
perf/x86/rapl: Support multiple RAPL unit quirks
perf/x86/rapl: Fix missing psys sysfs attributes
hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration
kprobes: Remove show_registers() function prototype
perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
Pull locking fixlets from Ingo Molnar:
"A documentation fix and a 'fallthrough' macro update"
* tag 'locking-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Convert to use the preferred 'fallthrough' macro
Documentation/locking/locktypes: Fix a typo
Merge more updates from Andrew Morton:
"Subsystems affected by this patch series: mm/hotfixes, lz4, exec,
mailmap, mm/thp, autofs, sysctl, mm/kmemleak, mm/misc and lib"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
virtio: pci: constify ioreadX() iomem argument (as in generic implementation)
ntb: intel: constify ioreadX() iomem argument (as in generic implementation)
rtl818x: constify ioreadX() iomem argument (as in generic implementation)
iomap: constify ioreadX() iomem argument (as in generic implementation)
sh: use generic strncpy()
sh: clkfwk: remove r8/r16/r32
include/asm-generic/vmlinux.lds.h: align ro_after_init
mm: annotate a data race in page_zonenum()
mm/swap.c: annotate data races for lru_rotate_pvecs
mm/rmap: annotate a data race at tlb_flush_batched
mm/mempool: fix a data race in mempool_free()
mm/list_lru: fix a data race in list_lru_count_one
mm/memcontrol: fix a data race in scan count
mm/page_counter: fix various data races at memsw
mm/swapfile: fix and annotate various data races
mm/filemap.c: fix a data race in filemap_fault()
mm/swap_state: mark various intentional data races
mm/page_io: mark various intentional data races
mm/frontswap: mark various intentional data races
mm/kmemleak: silence KCSAN splats in checksum
...
This remoes the code from the COW path to call debug_dma_assert_idle(),
which was added many years ago.
Google shows that it hasn't caught anything in the 6+ years we've had it
apart from a false positive, and Hugh just noticed how it had a very
unfortunate spinlock serialization in the COW path.
He fixed that issue the previous commit (a85ffd59bd: "dma-debug: fix
debug_dma_assert_idle(), use rcu_read_lock()"), but let's see if anybody
even notices when we remove this function entirely.
NOTE! We keep the dma tracking infrastructure that was added by the
commit that introduced it. Partly to make it easier to resurrect this
debug code if we ever deside to, and partly because that tracking by pfn
and offset looks quite reasonable.
The problem with this debug code was simply that it was expensive and
didn't seem worth it, not that it was wrong per se.
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common()
logic") improved unlock_page(), it has become more noticeable how
cow_user_page() in a kernel with CONFIG_DMA_API_DEBUG=y can create and
suffer from heavy contention on DMA debug's radix_lock in
debug_dma_assert_idle().
It is only doing a lookup: use rcu_read_lock() and rcu_read_unlock()
instead; though that does require the static ents[] to be moved
onstack...
...but, hold on, isn't that radix_tree_gang_lookup() and loop doing
quite the wrong thing: searching CACHELINES_PER_PAGE entries for an
exact match with the first cacheline of the page in question?
radix_tree_gang_lookup() is the right tool for the job, but we need
nothing more than to check the first entry it can find, reporting if
that falls anywhere within the page.
(Is RCU safe here? As safe as using the spinlock was. The entries are
never freed, so don't need to be freed by RCU. They may be reused, and
there is a faint chance of a race, with an offending entry reused while
printing its error info; but the spinlock did not prevent that either,
and I agree that it's not worth worrying about. ]
[ Side noe: this patch is a clear improvement to the status quo, but the
next patch will be removing this debug function entirely.
But just in case we decide we want to resurrect the debugging code
some day, I'm first applying this improvement patch so that it doesn't
get lost - Linus ]
Fixes: 3b7a6418c7 ("dma debug: account for cachelines and read-only mappings in overlap tracking")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timekeeping updates from Thomas Gleixner:
"A set of timekeeping/VDSO updates:
- Preparatory work to allow S390 to switch over to the generic VDSO
implementation.
S390 requires that the VDSO data pointer is handed in to the
counter read function when time namespace support is enabled.
Adding the pointer is a NOOP for all other architectures because
the compiler is supposed to optimize that out when it is unused in
the architecture specific inline. The change also solved a similar
problem for MIPS which fortunately has time namespaces not yet
enabled.
S390 needs to update clock related VDSO data independent of the
timekeeping updates. This was solved so far with yet another
sequence counter in the S390 implementation. A better solution is
to utilize the already existing VDSO sequence count for this. The
core code now exposes helper functions which allow to serialize
against the timekeeper code and against concurrent readers.
S390 needs extra data for their clock readout function. The initial
common VDSO data structure did not provide a way to add that. It
now has an embedded architecture specific struct embedded which
defaults to an empty struct.
Doing this now avoids tree dependencies and conflicts post rc1 and
allows all other architectures which work on generic VDSO support
to work from a common upstream base.
- A trivial comment fix"
* tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Delete repeated words in comments
lib/vdso: Allow to add architecture-specific vdso data
timekeeping/vsyscall: Provide vdso_update_begin/end()
vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
Pull more timer updates from Thomas Gleixner:
"A set of posix CPU timer changes which allows to defer the heavy work
of posix CPU timers into task work context. The tick interrupt is
reduced to a quick check which queues the work which is doing the
heavy lifting before returning to user space or going back to guest
mode. Moving this out is deferring the signal delivery slightly but
posix CPU timers are inaccurate by nature as they depend on the tick
so there is no real damage. The relevant test cases all passed.
This lifts the last offender for RT out of the hard interrupt context
tick handler, but it also has the general benefit that the actual
heavy work is accounted to the task/process and not to the tick
interrupt itself.
Further optimizations are possible to break long sighand lock hold and
interrupt disabled (on !RT kernels) times when a massive amount of
posix CPU timers (which are unpriviledged) is armed for a
task/process.
This is currently only enabled for x86 because the architecture has to
ensure that task work is handled in KVM before entering a guest, which
was just established for x86 with the new common entry/exit code which
got merged post 5.8 and is not the case for other KVM architectures"
* tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Select POSIX_CPU_TIMERS_TASK_WORK
posix-cpu-timers: Provide mechanisms to defer timer handling to task_work
posix-cpu-timers: Split run_posix_cpu_timers()
Pull irq fixes from Thomas Gleixner:
"Two fixes in the core interrupt code which ensure that all error exits
unlock the descriptor lock"
* tag 'irq-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Unlock irq descriptor after errors
genirq/PM: Always unlock IRQ descriptor in rearm_wake_irq()
Pull module updates from Jessica Yu:
"The most important change would be Christoph Hellwig's patch
implementing proprietary taint inheritance, in an effort to discourage
the creation of GPL "shim" modules that interface between GPL symbols
and proprietary symbols.
Summary:
- Have modules that use symbols from proprietary modules inherit the
TAINT_PROPRIETARY_MODULE taint, in an effort to prevent GPL shim
modules that are used to circumvent _GPL exports. These are modules
that claim to be GPL licensed while also using symbols from
proprietary modules. Such modules will be rejected while non-GPL
modules will inherit the proprietary taint.
- Module export space cleanup. Unexport symbols that are unused
outside of module.c or otherwise used in only built-in code"
* tag 'modules-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
modules: inherit TAINT_PROPRIETARY_MODULE
modules: return licensing information from find_symbol
modules: rename the licence field in struct symsearch to license
modules: unexport __module_address
modules: unexport __module_text_address
modules: mark each_symbol_section static
modules: mark find_symbol static
modules: mark ref_module static
modules: linux/moduleparam.h: drop duplicated word in a comment
Current sysrq(t) output task fields name are not aligned with
actual task fields value, e.g.:
kernel: sysrq: Show State
kernel: task PC stack pid father
kernel: systemd S12456 1 0 0x00000000
kernel: Call Trace:
kernel: ? __schedule+0x240/0x740
To make it more readable, print fields name together with task fields
value in the same line, with fixed width:
kernel: sysrq: Show State
kernel: task:systemd state:S stack:12920 pid: 1 ppid: 0 flags:0x00000000
kernel: Call Trace:
kernel: __schedule+0x282/0x620
Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200814030236.37835-1-libing.zhou@nokia-sbell.com
Pull networking fixes from David Miller:
"Some merge window fallout, some longer term fixes:
1) Handle headroom properly in lapbether and x25_asy drivers, from
Xie He.
2) Fetch MAC address from correct r8152 device node, from Thierry
Reding.
3) In the sw kTLS path we should allow MSG_CMSG_COMPAT in sendmsg,
from Rouven Czerwinski.
4) Correct fdputs in socket layer, from Miaohe Lin.
5) Revert troublesome sockptr_t optimization, from Christoph Hellwig.
6) Fix TCP TFO key reading on big endian, from Jason Baron.
7) Missing CAP_NET_RAW check in nfc, from Qingyu Li.
8) Fix inet fastreuse optimization with tproxy sockets, from Tim
Froidcoeur.
9) Fix 64-bit divide in new SFC driver, from Edward Cree.
10) Add a tracepoint for prandom_u32 so that we can more easily
perform usage analysis. From Eric Dumazet.
11) Fix rwlock imbalance in AF_PACKET, from John Ogness"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits)
net: openvswitch: introduce common code for flushing flows
af_packet: TPACKET_V3: fix fill status rwlock imbalance
random32: add a tracepoint for prandom_u32()
Revert "ipv4: tunnel: fix compilation on ARCH=um"
net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus
net: ethernet: stmmac: Disable hardware multicast filter
net: stmmac: dwmac1000: provide multicast filter fallback
ipv4: tunnel: fix compilation on ARCH=um
vsock: fix potential null pointer dereference in vsock_poll()
sfc: fix ef100 design-param checking
net: initialize fastreuse on inet_inherit_port
net: refactor bind_bucket fastreuse into helper
net: phy: marvell10g: fix null pointer dereference
net: Fix potential memory leak in proto_register()
net: qcom/emac: add missed clk_disable_unprepare in error path of emac_clks_phase1_init
ionic_lif: Use devm_kcalloc() in ionic_qcq_alloc()
net/nfc/rawsock.c: add CAP_NET_RAW check.
hinic: fix strncpy output truncated compile warnings
drivers/net/wan/x25_asy: Added needed_headroom and a skb->len check
net/tls: Fix kmap usage
...
If JOBCTL_TASK_WORK is already set on the targeted task, then we need
not go through {lock,unlock}_task_sighand() to set it again and queue
a signal wakeup. This is safe as we're checking it _after_ adding the
new task_work with cmpxchg().
The ordering is as follows:
task_work_add() get_signal()
--------------------------------------------------------------
STORE(task->task_works, new_work); STORE(task->jobctl);
mb(); mb();
LOAD(task->jobctl); LOAD(task->task_works);
This speeds up TWA_SIGNAL handling quite a bit, which is important now
that io_uring is relying on it for all task_work deliveries.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jann Horn <jannh@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge more updates from Andrew Morton:
- most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
memory-hotplug, cleanups, uaccess, migration, gup, pagemap),
- various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
exec, kdump, rapidio, panic, kcov, kgdb, ipc).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
mm/gup: remove task_struct pointer for all gup code
mm: clean up the last pieces of page fault accountings
mm/xtensa: use general page fault accounting
mm/x86: use general page fault accounting
mm/sparc64: use general page fault accounting
mm/sparc32: use general page fault accounting
mm/sh: use general page fault accounting
mm/s390: use general page fault accounting
mm/riscv: use general page fault accounting
mm/powerpc: use general page fault accounting
mm/parisc: use general page fault accounting
mm/openrisc: use general page fault accounting
mm/nios2: use general page fault accounting
mm/nds32: use general page fault accounting
mm/mips: use general page fault accounting
mm/microblaze: use general page fault accounting
mm/m68k: use general page fault accounting
mm/ia64: use general page fault accounting
mm/hexagon: use general page fault accounting
mm/csky: use general page fault accounting
...
Make kernel GNU build-id available in VMCOREINFO. Having build-id in
VMCOREINFO facilitates presenting appropriate kernel namelist image with
debug information file to kernel crash dump analysis tools. Currently
VMCOREINFO lacks uniquely identifiable key for crash analysis automation.
Regarding if this patch is necessary or matching of linux_banner and
OSRELEASE in VMCOREINFO employed by crash(8) meets the need -- IMO,
build-id approach more foolproof, in most instances it is a cryptographic
hash generated using internal code/ELF bits unlike kernel version string
upon which linux_banner is based that is external to the code. I feel
each is intended for a different purpose. Also OSRELEASE is not suitable
when two different kernel builds from same version with different features
enabled.
Currently for most linux (and non-linux) systems build-id can be extracted
using standard methods for file types such as user mode crash dumps,
shared libraries, loadable kernel modules etc., This is an exception for
linux kernel dump. Having build-id in VMCOREINFO brings some uniformity
for automation tools.
Tyler said:
: I think this is a nice improvement over today's linux_banner approach for
: correlating vmlinux to a kernel dump.
:
: The elf notes parsing in this patch lines up with what is described in in
: the "Notes (Nhdr)" section of the elf(5) man page.
:
: BUILD_ID_MAX is sufficient to hold a sha1 build-id, which is the default
: build-id type today in GNU ld(2). It is also sufficient to hold the
: "fast" build-id, which is the default build-id type today in LLVM lld(2).
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/r/1591849672-34104-1-git-send-email-vijayb@linux.microsoft.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>