Commit Graph

29899 Commits

Author SHA1 Message Date
Paul E. McKenney
6cdbc07a5a Merge branches 'consolidate.2019.04.09a', 'doc.2019.03.26b', 'fixes.2019.03.26b', 'srcu.2019.03.26b', 'stall.2019.03.26b' and 'torture.2019.03.26b' into HEAD
consolidate.2019.04.09a: Lingering RCU flavor consolidation cleanups.
doc.2019.03.26b: Documentation updates.
fixes.2019.03.26b: Miscellaneous fixes.
srcu.2019.03.26b: SRCU updates.
stall.2019.03.26b: RCU CPU stall warning updates.
torture.2019.03.26b: Torture-test updates.
2019-04-09 08:08:13 -07:00
Paul E. McKenney
a9d6938ddb locktorture: NULL cxt.lwsa and cxt.lrsa to allow bad-arg detection
Currently, lock_torture_cleanup() uses the values of cxt.lwsa and cxt.lrsa
to detect bad parameters that prevented locktorture from initializing,
let alone running.  In this case, lock_torture_cleanup() does no cleanup
aside from invoking torture_cleanup_begin() and torture_cleanup_end(),
as required to permit future torture tests to run.  However, this
heuristic fails if the run with bad parameters was preceded by a previous
run that actually ran:  In this case, both cxt.lwsa and cxt.lrsa will
remain non-zero, which means that the current lock_torture_cleanup()
invocation will be unable to detect the fact that it should skip cleanup,
which can result in charming outcomes such as double frees.

This commit therefore NULLs out both cxt.lwsa and cxt.lrsa at the end
of any run that actually ran.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Josh Triplett <josh@joshtriplett.org>
2019-03-26 14:42:53 -07:00
Paul E. McKenney
ad092c0277 rcuperf: Fix cleanup path for invalid perf_type strings
If the specified rcuperf.perf_type is not in the rcu_perf_init()
function's perf_ops[] array, rcuperf prints some console messages and
then invokes rcu_perf_cleanup() to set state so that a future torture
test can run.  However, rcu_perf_cleanup() also attempts to end the
test that didn't actually start, and in doing so relies on the value
of cur_ops, a value that is not particularly relevant in this case.
This can result in confusing output or even follow-on failures due to
attempts to use facilities that have not been properly initialized.

This commit therefore sets the value of cur_ops to NULL in this case and
inserts a check near the beginning of rcu_perf_cleanup(), thus avoiding
relying on an irrelevant cur_ops value.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:42:53 -07:00
Paul E. McKenney
b813afae7a rcutorture: Fix cleanup path for invalid torture_type strings
If the specified rcutorture.torture_type is not in the rcu_torture_init()
function's torture_ops[] array, rcutorture prints some console messages
and then invokes rcu_torture_cleanup() to set state so that a future
torture test can run.  However, rcu_torture_cleanup() also attempts to
end the test that didn't actually start, and in doing so relies on the
value of cur_ops, a value that is not particularly relevant in this case.
This can result in confusing output or even follow-on failures due to
attempts to use facilities that have not been properly initialized.

This commit therefore sets the value of cur_ops to NULL in this case
and inserts a check near the beginning of rcu_torture_cleanup(),
thus avoiding relying on an irrelevant cur_ops value.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:42:53 -07:00
Neeraj Upadhyay
d44ac1bebc rcutorture: Fix expected forward progress duration in OOM notifier
The rcutorture_oom_notify() function has a misplaced close parenthesis
that results in increasingly long delays in rcu_fwd_progress_check()'s
checking for various RCU forward-progress problems.  This commit therefore
puts the parenthesis in the right place.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:42:53 -07:00
Paul E. McKenney
f47cb1bb0d rcutorture: Remove ->ext_irq_conflict field
Back when there was a separate RCU-bh flavor, the ->ext_irq_conflict
field was used to prevent executing local_bh_enable() while interrupts
were disabled.  However, there is no longer an RCU-bh flavor, so this
commit removes the no-longer-needed ->ext_irq_conflict field.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:42:53 -07:00
Paul E. McKenney
a3b0e1e59e rcutorture: Make rcutorture_extend_mask() comment match the code
The code actually rarely uses more than one type of RCU read-side
protection, as is actually desired given that we need some reasonable
probability of preempting RCU read-side critical sections, which cannot
happen with multiple types of protection.  This comment therefore adjusts
the comment.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:42:53 -07:00
Paul E. McKenney
24aca4aea4 torture: Don't try to offline the last CPU
If there is only one online CPU, it doesn't make sense to try to offline
it, as any such attempt is guaranteed to fail.  This commit therefore
check for this condition and refuses to attempt the nonsensical.

Reported-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-By: Su Yue <suy.fnst@cn.fujitsu.com>
2019-03-26 14:42:53 -07:00
Neeraj Upadhyay
6c70e9cd5f rcu: Fix nohz status in stall warning
The Documentation/RCU/stallwarn.txt file says that stall warnings
print "D" if dyntick-idle processing is enabled, but the code in
print_cpu_stall_fast_no_hz() prints "." instead.  This commit therefore
reverses the sense of the test to make the code match the documentation.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:42:00 -07:00
Paul E. McKenney
b51bcbbf16 rcu: Move forward-progress checkers into tree_stall.h
This commit further consolidates stall-warning functionality by moving
forward-progress checkers into kernel/rcu/tree_stall.h, updating a
comment or two while in the area.  More specifically, this commit moves
show_rcu_gp_kthreads(), rcu_check_gp_start_stall(), rcu_fwd_progress_check(),
sysrq_rcu, sysrq_show_rcu(), sysrq_rcudump_op, and rcu_sysrq_init() from
kernel/rcu/tree.c to kernel/rcu/tree_stall.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
7ac1907c9e rcu: Move irq-disabled stall-warning checking to tree_stall.h
The rcu_iw_handler() function's sole purpose in life is to indicate
whether a stalled CPU had interrupts disabled, so it belongs in
kernel/rcu/tree_stall.h.  This commit therefore makes that move,
clarifying its header comment while in the area.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
e23344c2ca rcu: Organize functions in tree_stall.h
This commit does only code movement, removal of now-unneeded forward
declarations, and addition of comments.  It organizes the functions
that implement RCU CPU stall warnings for normal grace periods into
three categories:

1.	Control of RCU CPU stall warnings, including computing timeouts.

2.	Interaction of stall warnings with grace periods.

3.	Actual printing of the RCU CPU stall-warning messages.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
59b73a2768 rcu: Move FAST_NO_HZ stall-warning code to tree_stall.h
This commit further consolidates the stall-warning code by moving
print_cpu_stall_info() and its helper functions along with
zero_cpu_stall_ticks() to kernel/rcu/tree_stall.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
40e69ac7d0 rcu: Inline RCU stall-warning info helper functions
The print_cpu_stall_info_begin() and print_cpu_stall_info_end() print a
single character each onto the console, and are a holdover from a time
when RCU CPU stall warning messages could be abbreviated using a long-gone
Kconfig option.  This commit therefore adds these single characters to
already-printed strings in the calling functions, and then eliminates
both print_cpu_stall_info_begin() and print_cpu_stall_info_end().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
d87cda5094 rcu: Move rcu_print_task_exp_stall() to tree_exp.h
Because expedited CPU stall warnings are contained within the
kernel/rcu/tree_exp.h file, rcu_print_task_exp_stall() should live
there too.  This commit carries out the required code motion.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
21d0d79ab0 rcu: Inline RCU task stall-warning helper functions
The rcu_print_detail_task_stall(), rcu_print_task_stall_begin(), and
rcu_print_task_stall_end() functions were defined to allow long-gone
Kconfig options to provide an abbreviated RCU CPU stall warning printout.
This commit saves a few lines of code by inlining them into their sole
callers.

While in the area, a useless call of rcu_print_detail_task_stall_rnp()
on the root rcu_node structure was eliminated.  If there is only one
rcu_node structure, its tasks get printed twice, but if there are more,
the root rcu_node structure is guaranteed to have an empty list of blocked
tasks, hence the uselessness.  (Long ago, root rcu_node structures with
non-empty ->blkd_tasks lists could happen, but no longer.)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
32255d51b6 rcu: Move RCU CPU stall-warning code out of tree.c
This commit completes the process of consolidating the code for RCU CPU
stall warnings for normal grace periods by moving the remaining such
code from kernel/rcu/tree.c to kernel/rcu/tree_stall.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
3fc3d1709f rcu: Move RCU CPU stall-warning code out of tree_plugin.h
The RCU CPU stall-warning code for normal grace periods is currently
scattered across two files, due to earlier Tiny RCU support for RCU
CPU stall warnings and for old Kconfig options that have long since
been retired.  Given that it is hard for the lead RCU maintainer to
find relevant stall-warning code, it would be good to consolidate it.
This commit continues this process by moving stall-warning code from
kernel/rcu/tree_plugin.c to a new kernel/rcu/tree_stall.h file.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
10462d6f58 rcu: Move RCU CPU stall-warning code out of update.c
The RCU CPU stall-warning code for normal grace periods is currently
scattered across three files, due to earlier Tiny RCU support for RCU
CPU stall warnings and for old Kconfig options that have long since
been retired.  Given that it is hard for the lead RCU maintainer to
find relevant stall-warning code, it would be good to consolidate it.
This commit starts this process by moving stall-warning code from
kernel/rcu/update.c to a new kernel/rcu/tree_stall.h file.

Note that the definitions of rcu_cpu_stall_suppress and
rcu_cpu_stall_timeout must remain in kernel/rcu/update.h to provide
compatibility for kernel boot parameter lists.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:40:13 -07:00
Paul E. McKenney
f5ad399149 srcu: Remove cleanup_srcu_struct_quiesced()
The cleanup_srcu_struct_quiesced() function was added because NVME
used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that
NVME workqueues waiting on SRCU workqueues could result in deadlocks
during low-memory conditions.  However, SRCU now also has WQ_MEM_RECLAIM
workqueues, so there is no longer a potential for deadlock.  Furthermore,
it turns out to be extremely hard to use cleanup_srcu_struct_quiesced()
correctly due to the fact that SRCU callback invocation accesses the
srcu_struct structure's per-CPU data area just after callbacks are
invoked.  Therefore, the usual practice of using srcu_barrier() to wait
for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced()
fails because SRCU's callback-invocation workqueue handler might be
delayed, which can result in cleanup_srcu_struct_quiesced() being invoked
(and thus freeing the per-CPU data) before the SRCU's callback-invocation
workqueue handler is finished using that per-CPU data.  Nor is this a
theoretical problem: KASAN emitted use-after-free warnings because of
this problem on actual runs.

In short, NVME can now safely invoke cleanup_srcu_struct(), which
avoids the use-after-free scenario.  And cleanup_srcu_struct_quiesced()
is quite difficult to use safely.  This commit therefore removes
cleanup_srcu_struct_quiesced(), switching its sole user back to
cleanup_srcu_struct().  This effectively reverts the following pair
of commits:

f7194ac32c ("srcu: Add cleanup_srcu_struct_quiesced()")
4317228ad9 ("nvme: Avoid flush dependency in delete controller flow")

Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
2019-03-26 14:39:24 -07:00
Paul E. McKenney
5cdfd174ea srcu: Check for in-flight callbacks in _cleanup_srcu_struct()
If someone fails to drain the corresponding SRCU callbacks (for
example, by failing to invoke srcu_barrier()) before invoking either
cleanup_srcu_struct() or cleanup_srcu_struct_quiesced(), the resulting
diagnostic is an ambiguous use-after-free diagnostic, and even then
only if you are running something like KASAN.  This commit therefore
improves SRCU diagnostics by adding checks for in-flight callbacks at
_cleanup_srcu_struct() time.

Note that these diagnostics can still be defeated, for example, by
invoking call_srcu() concurrently with cleanup_srcu_struct().  Which is
a really bad idea, but sometimes all too easy to do.  But even then,
these diagnostics have at least some probability of catching the problem.

Reported-by: Sagi Grimberg <sagi@grimberg.me>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
2019-03-26 14:39:24 -07:00
Paul E. McKenney
add0d37b4f rcu: Correct READ_ONCE()/WRITE_ONCE() for ->rcu_read_unlock_special
The task_struct structure's ->rcu_read_unlock_special field is only ever
read or written by the owning task, but it is accessed both at process
and interrupt levels.  It may therefore be accessed using plain reads
and writes while interrupts are disabled, but must be accessed using
READ_ONCE() and WRITE_ONCE() or better otherwise.  This commit makes a
few adjustments to align with this discipline.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:38:38 -07:00
Paul E. McKenney
f1a98045ab rcu: Fix typo in tree_exp.h comment
This commit changes a rcu_exp_handler() comment from rcu_preempt_defer_qs()
to rcu_preempt_deferred_qs() in order to better match reality.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:38:38 -07:00
Paul E. McKenney
a2badefa85 rcu: Eliminate redundant NULL-pointer check
Because rcu_wake_cond() checks for a null task_struct pointer, there is
no need for its callers to do so.  This commit eliminates the redundant
check.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:38:38 -07:00
Zhouyi Zhou
5d8a752e31 rcu: Fix force_qs_rnp() header comment
Previously, threads blocked on offlining CPUS were migrated to the
root rcu_node structure, thus requiring RCU priority boosting on this
structure.  However, since commit d19fb8d1f3 ("rcu: Don't migrate
blocked tasks even if all corresponding CPUs offline"), RCU does not
migrate blocked tasks.  Consequently, RCU no longer does RCU priority
boosting on the root rcu_node structure as of commit 1be0085b51 ("rcu:
Don't initiate RCU priority boosting on root rcu_node").

This commit therefore brings comments for the force_qs_rnp() function's
header comment in line with this new no-root-boosting reality.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
[ paulmck: Also remove obsolete comment on suppressing new grace periods. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:38:38 -07:00
Paul E. McKenney
85f2b60c43 rcu: Update jiffies_to_sched_qs and adjust_jiffies_till_sched_qs() comments
This commit better documents the jiffies_to_sched_qs default-value
strategy used by adjust_jiffies_till_sched_qs()

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:38:38 -07:00
Neeraj Upadhyay
6973032a60 rcu: Default jiffies_to_sched_qs to jiffies_till_sched_qs
The current code only calls adjust_jiffies_till_sched_qs() if
jiffies_till_sched_qs is left at its default value, so when the
jiffies_till_sched_qs kernel-boot parameter actually is specified,
jiffies_to_sched_qs will be left with the value zero, which
will result in useless slowdowns of cond_resched().  This commit
therefore changes rcu_init_geometry() to unconditionally invoke
adjust_jiffies_till_sched_qs(), which ensures that jiffies_to_sched_qs
will be initialized in all cases, thus maintaining good cond_resched()
performance.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:38:38 -07:00
Neeraj Upadhyay
0f58d2ac2c rcu: Fix self-wakeups for grace-period kthread
The current rcu_gp_kthread_wake() function uses in_interrupt()
and thus does a self-wakeup from all interrupt contexts, including
the pointless case where the GP kthread happens to be running with
bottom halves disabled, along with the impossible case where the GP
kthread is running within an NMI handler (you are not supposed to invoke
rcu_gp_kthread_wake() from within an NMI handler.  This commit therefore
replaces the in_interrupt() with in_irq(), so that the self-wakeups
happen only from handlers for hardware interrupts and softirqs.
This also makes the code match the comment.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-26 14:37:49 -07:00
Paul E. McKenney
497e42600b rcu: Report error for bad rcu_nocbs= parameter values
This commit prints a console message when cpulist_parse() reports a
bad list of CPUs, and sets all CPUs' bits in that case.  The reason for
setting all CPUs' bits is that this is the safe(r) choice for real-time
workloads, which would normally be the ones using the rcu_nocbs= kernel
boot parameter.  Either way, later RCU console log messages list the
actual set of CPUs whose RCU callbacks will be offloaded.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:37:49 -07:00
Paul E. McKenney
da8739f23f rcu: Allow rcu_nocbs= to specify all CPUs
Currently, the rcu_nocbs= kernel boot parameter requires that a specific
list of CPUs be specified, and has no way to say "all of them".
As noted by user RavFX in a comment to Phoronix topic 1002538, this
is an inconvenient side effect of the removal of the RCU_NOCB_CPU_ALL
Kconfig option.  This commit therefore enables the rcu_nocbs= kernel boot
parameter to be given the string "all", as in "rcu_nocbs=all" to specify
that all CPUs on the system are to have their RCU callbacks offloaded.

Another approach would be to make cpulist_parse() check for "all", but
there are uses of cpulist_parse() that do other checking, which could
conflict with an "all".  This commit therefore focuses on the specific
use of cpulist_parse() in rcu_nocb_setup().

Just a note to other people who would like changes to Linux-kernel RCU:
If you send your requests to me directly, they might get fixed somewhat
faster.  RavFX's comment was posted on January 22, 2018 and I first saw
it on March 5, 2019.  And the only reason that I found it -at- -all- was
that I was looking for projects using RCU, and my search engine showed
me that Phoronix comment quite by accident.  Your choice, though!  ;-)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:37:49 -07:00
Akira Yokosawa
b2eb85b49a rcu: Move common code out of if-else block
As the result of recent addition of "rdp->core_needs_qs = false;" in
the "if" block, now both branches of the if-else have the same
assignment.

Factor it out and reduce line count.

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-03-26 14:37:49 -07:00
Liu Song
3ffe3d1adc rcu: Set rcutree.kthread_prio sysfs access to read-only
The rcutree.kthread_prio kernel-boot parameter is used to set the
priority for boost (rcub), per-CPU (rcuc), and grace-period (rcu_preempt
or rcu_sched) kthreads.  It is also used by rcutorture to check whether
it is possible to meaningfully test RCU priority boosting.  However,
all of these cases will either ignore or be confused by any post-boot
changes to rcutree.kthread_prio.

Note that the user really can change the priorities of all of these
kthreads using chrt, given sufficient privileges.  Therefore, the
read-write nature of sysfs access to rcutree.kthread_prio is thus at
best an attractive nuisance.

This commit therefore changes sysfs access to rcutree.kthread_prio to
be read-only.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:37:49 -07:00
Paul E. McKenney
884157cef0 rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section.  It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur.  This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task.  The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.

Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case.  This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.

Note that deferred quiescent states need not be considered.  The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.

Note also that negative values of ->rcu_read_lock_nesting need not be
considered.  First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.

Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting.  Furthermore, there have been no reports of the bug
fixed by this commit appearing in production.  This commit is therefore
absolutely -not- recommended for backporting to -stable.

Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-03-26 14:37:49 -07:00
Cyrill Gorcunov
18d7e40679 rcu: rcu_qs -- Use raise_softirq_irqoff to not save irqs twice
The rcu_qs is disabling IRQs by self so no need to do the same in raise_softirq
but instead we can save some cycles using raise_softirq_irqoff directly.

CC: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:37:49 -07:00
Joel Fernandes (Google)
671a63517c rcu: Avoid unnecessary softirq when system is idle
When there are no callbacks pending on an idle system, I noticed that
RCU softirq is continuously firing. During this the cpu_no_qs is set to
false, and core_needs_qs is set to true indefinitely. This causes
rcu_process_callbacks to be repeatedly called, even though the node
corresponding to the CPU has that CPU's mask bit cleared and the system
is idle. I believe the race is when such mask clearing is done during
idle CPU scan of the quiescent state forcing stage in the kthread
instead of the softirq. Since the rnp mask is cleared, but the flags on
the CPU's rdp are not cleared, the CPU thinks it still needs to report
to core RCU.

Cure this by clearing the core_needs_qs flag when the CPU detects that
its node is already updated which will avoid the unwanted softirq raises
to the benefit of real-time systems.

Test: Ran rcutorture for various tree RCU configs.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:37:49 -07:00
Paul E. McKenney
e85e6a21b2 rcu: Unconditionally expedite during suspend/hibernate
The rcu_pm_notify() function refuses to switch to/from expedited grace
periods on systems with more than 256 CPUs due to the serialized
initialization of expedited grace periods.  However, expedited grace
periods are now initialized in parallel, removing this concern.
This commit therefore removes the checks from rcu_pm_notify(), so that
expedited grace periods are used unconditionally during suspend/resume
and hibernate/wake operations.

As always, real-time workloads wishing to completely avoid expedited
grace periods can use the rcupdate.rcu_normal= kernel parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-26 14:37:49 -07:00
Linus Torvalds
a9dce6679d Merge tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd system call from Christian Brauner:
 "This introduces the ability to use file descriptors from /proc/<pid>/
  as stable handles on struct pid. Even if a pid is recycled the handle
  will not change. For a start these fds can be used to send signals to
  the processes they refer to.

  With the ability to use /proc/<pid> fds as stable handles on struct
  pid we can fix a long-standing issue where after a process has exited
  its pid can be reused by another process. If a caller sends a signal
  to a reused pid it will end up signaling the wrong process.

  With this patchset we enable a variety of use cases. One obvious
  example is that we can now safely delegate an important part of
  process management - sending signals - to processes other than the
  parent of a given process by sending file descriptors around via scm
  rights and not fearing that the given process will have been recycled
  in the meantime. It also allows for easy testing whether a given
  process is still alive or not by sending signal 0 to a pidfd which is
  quite handy.

  There has been some interest in this feature e.g. from systems
  management (systemd, glibc) and container managers. I have requested
  and gotten comments from glibc to make sure that this syscall is
  suitable for their needs as well. In the future I expect it to take on
  most other pid-based signal syscalls. But such features are left for
  the future once they are needed.

  This has been sitting in linux-next for quite a while and has not
  caused any issues. It comes with selftests which verify basic
  functionality and also test that a recycled pid cannot be signaled via
  a pidfd.

  Jon has written about a prior version of this patchset. It should
  cover the basic functionality since not a lot has changed since then:

      https://lwn.net/Articles/773459/

  The commit message for the syscall itself is extensively documenting
  the syscall, including it's functionality and extensibility"

* tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests: add tests for pidfd_send_signal()
  signal: add pidfd_send_signal() syscall
2019-03-16 13:47:14 -07:00
Linus Torvalds
f67e3fb489 Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull device-dax updates from Dan Williams:
 "New device-dax infrastructure to allow persistent memory and other
  "reserved" / performance differentiated memories, to be assigned to
  the core-mm as "System RAM".

  Some users want to use persistent memory as additional volatile
  memory. They are willing to cope with potential performance
  differences, for example between DRAM and 3D Xpoint, and want to use
  typical Linux memory management apis rather than a userspace memory
  allocator layered over an mmap() of a dax file. The administration
  model is to decide how much Persistent Memory (pmem) to use as System
  RAM, create a device-dax-mode namespace of that size, and then assign
  it to the core-mm. The rationale for device-dax is that it is a
  generic memory-mapping driver that can be layered over any "special
  purpose" memory, not just pmem. On subsequent boots udev rules can be
  used to restore the memory assignment.

  One implication of using pmem as RAM is that mlock() no longer keeps
  data off persistent media. For this reason it is recommended to enable
  NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
  at rest. We considered making this recommendation an actively enforced
  requirement, but in the end decided to leave it as a distribution /
  administrator policy to allow for emulation and test environments that
  lack security capable NVDIMMs.

  Summary:

   - Replace the /sys/class/dax device model with /sys/bus/dax, and
     include a compat driver so distributions can opt-in to the new ABI.

   - Allow for an alternative driver for the device-dax address-range

   - Introduce the 'kmem' driver to hotplug / assign a device-dax
     address-range to the core-mm.

   - Arrange for the device-dax target-node to be onlined so that the
     newly added memory range can be uniquely referenced by numa apis"

NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.

And apparently the PMEM modules can cause that a lot more than regular
RAM.  The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.

Quoting Dan from another email:
 "The exposure can be reduced in the volatile-RAM case by scanning for
  and clearing errors before it is onlined as RAM. The userspace tooling
  for that can be in place before v5.1-final. There's also runtime
  notifications of errors via acpi_nfit_uc_error_notify() from
  background scrubbers on the DIMM devices. With that mechanism the
  kernel could proactively clear newly discovered poison in the volatile
  case, but that would be additional development more suitable for v5.2.

  I understand the concern, and the need to highlight this issue by
  tapping the brakes on feature development, but I don't see PMEM as RAM
  making the situation worse when the exposure is also there via DAX in
  the PMEM case. Volatile-RAM is arguably a safer use case since it's
  possible to repair pages where the persistent case needs active
  application coordination"

* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: "Hotplug" persistent memory for use like normal RAM
  mm/resource: Let walk_system_ram_range() search child resources
  mm/memory-hotplug: Allow memory resources to be children
  mm/resource: Move HMM pr_debug() deeper into resource code
  mm/resource: Return real error codes from walk failures
  device-dax: Add a 'modalias' attribute to DAX 'bus' devices
  device-dax: Add a 'target_node' attribute
  device-dax: Auto-bind device after successful new_id
  acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
  device-dax: Add /sys/class/dax backwards compatibility
  device-dax: Add support for a dax override driver
  device-dax: Move resource pinning+mapping into the common driver
  device-dax: Introduce bus + driver model
  device-dax: Start defining a dax bus model
  device-dax: Remove multi-resource infrastructure
  device-dax: Kill dax_region base
  device-dax: Kill dax_region ida
2019-03-16 13:05:32 -07:00
Linus Torvalds
11efae3506 Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block
Pull more block layer changes from Jens Axboe:
 "This is a collection of both stragglers, and fixes that came in after
  I finalized the initial pull. This contains:

   - An MD pull request from Song, with a few minor fixes

   - Set of NVMe patches via Christoph

   - Pull request from Konrad, with a few fixes for xen/blkback

   - pblk fix IO calculation fix (Javier)

   - Segment calculation fix for pass-through (Ming)

   - Fallthrough annotation for blkcg (Mathieu)"

* tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits)
  blkcg: annotate implicit fall through
  nvme-tcp: support C2HData with SUCCESS flag
  nvmet: ignore EOPNOTSUPP for discard
  nvme: add proper write zeroes setup for the multipath device
  nvme: add proper discard setup for the multipath device
  nvme: remove nvme_ns_config_oncs
  nvme: disable Write Zeroes for qemu controllers
  nvmet-fc: bring Disconnect into compliance with FC-NVME spec
  nvmet-fc: fix issues with targetport assoc_list list walking
  nvme-fc: reject reconnect if io queue count is reduced to zero
  nvme-fc: fix numa_node when dev is null
  nvme-fc: use nr_phys_segments to determine existence of sgl
  nvme-loop: init nvmet_ctrl fatal_err_work when allocate
  nvme: update comment to make the code easier to read
  nvme: put ns_head ref if namespace fails allocation
  nvme-trace: fix cdw10 buffer overrun
  nvme: don't warn on block content change effects
  nvme: add get-feature to admin cmds tracer
  md: Fix failed allocation of md_register_thread
  It's wrong to add len to sector_nr in raid10 reshape twice
  ...
2019-03-16 12:36:39 -07:00
Linus Torvalds
aa2e3ac64a Merge tag 'trace-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes and cleanups from Steven Rostedt:
 "This contains a series of last minute clean ups, small fixes and error
  checks"

* tag 'trace-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/probe: Verify alloc_trace_*probe() result
  tracing/probe: Check event/group naming rule at parsing
  tracing/probe: Check the size of argument name and body
  tracing/probe: Check event name length correctly
  tracing/probe: Check maxactive error cases
  tracing: kdb: Fix ftdump to not sleep
  trace/probes: Remove kernel doc style from non kernel doc comment
  tracing/probes: Make reserved_field_names static
2019-03-15 14:47:02 -07:00
Linus Torvalds
2b9c272cf5 Merge tag 'fbdev-v5.1' of git://github.com/bzolnier/linux
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
 "Just a couple of small fixes and cleanups:

   - fix memory access if logo is bigger than the screen (Manfred
     Schlaegl)

   - silence fbcon logo on 'quiet' boots (Prarit Bhargava)

   - use kvmalloc() for scrollback buffer in fbcon (Konstantin Khorenko)

   - misc fixes (Colin Ian King, YueHaibing, Matteo Croce, Mathieu
     Malaterre, Anders Roxell, Arnd Bergmann)

   - misc cleanups (Rob Herring, Lubomir Rintel, Greg Kroah-Hartman,
     Jani Nikula, Michal Vokáč)"

* tag 'fbdev-v5.1' of git://github.com/bzolnier/linux:
  fbdev: mbx: fix a misspelled variable name
  fbdev: omap2: fix warnings in dss core
  video: fbdev: Fix potential NULL pointer dereference
  fbcon: Silence fbcon logo on 'quiet' boots
  printk: Export console_printk
  ARM: dts: imx28-cfa10036: Fix the reset gpio signal polarity
  video: ssd1307fb: Do not hard code active-low reset sequence
  dt-bindings: display: ssd1307fb: Remove reset-active-low from examples
  fbdev: fbmem: fix memory access if logo is bigger than the screen
  video/fbdev: refactor video= cmdline parsing
  fbdev: mbx: fix up debugfs file creation
  fbdev: omap2: no need to check return value of debugfs_create functions
  video: fbdev: geode: remove ifdef OLPC noise
  video: offb: annotate implicit fall throughs
  omapfb: fix typo
  fbdev: Use of_node_name_eq for node name comparisons
  fbcon: use kvmalloc() for scrollback buffer
  fbdev: chipsfb: remove set but not used variable 'size'
  fbdev/via: fix spelling mistake "Expandsion" -> "Expansion"
2019-03-15 14:22:59 -07:00
Masami Hiramatsu
a039480e9e tracing/probe: Verify alloc_trace_*probe() result
Since alloc_trace_*probe() returns -EINVAL only if !event && !group,
it should not happen in trace_*probe_create(). If we catch that case
there is a bug. So use WARN_ON_ONCE() instead of pr_info().

Link: http://lkml.kernel.org/r/155253785078.14922.16902223633734601469.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:54:21 -04:00
Masami Hiramatsu
5b7a962209 tracing/probe: Check event/group naming rule at parsing
Check event and group naming rule at parsing it instead
of allocating probes.

Link: http://lkml.kernel.org/r/155253784064.14922.2336893061156236237.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:54:11 -04:00
Masami Hiramatsu
b4443c17a3 tracing/probe: Check the size of argument name and body
Check the size of argument name and expression is not 0
and smaller than maximum length.

Link: http://lkml.kernel.org/r/155253783029.14922.12650939303827581096.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:53:57 -04:00
Masami Hiramatsu
dec65d79fd tracing/probe: Check event name length correctly
Ensure given name of event is not too long when parsing it,
and fix to update event name offset correctly when the group
name is given. For example, this makes probe event to check
the "p:foo/" error case correctly.

Link: http://lkml.kernel.org/r/155253782046.14922.14724124823730168629.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:53:47 -04:00
Masami Hiramatsu
287c038c0b tracing/probe: Check maxactive error cases
Check maxactive on kprobe error case, because maxactive
is only for kretprobe, not for kprobe. Also, maxactive
should not be 0, it should be at least 1.

Link: http://lkml.kernel.org/r/155253780952.14922.15784129810238750331.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-14 19:53:29 -04:00
Mathieu Malaterre
f6d85f04e2 blkcg: annotate implicit fall through
There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).

This commit remove the following warning:

  kernel/trace/blktrace.c:725:9: warning: this statement may fall through [-Wimplicit-fallthrough=]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 14:31:12 -06:00
Douglas Anderson
31b265b3ba tracing: kdb: Fix ftdump to not sleep
As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
BUG for "sleeping function called from invalid context".

kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
atomic context.  A very simple solution for this is to add allocation
flags to ring_buffer_read_prepare() so kdb can call it without
triggering the allocation error.  This patch does that.

Note that in the original email thread about this, it was suggested
that perhaps the solution for kdb was to either preallocate the buffer
ahead of time or create our own iterator.  I'm hoping that this
alternative of adding allocation flags to ring_buffer_read_prepare()
can be considered since it means I don't need to duplicate more of the
core trace code into "trace_kdb.c" (for either creating my own
iterator or re-preparing a ring allocator whose memory was already
allocated).

NOTE: another option for kdb is to actually figure out how to make it
reuse the existing ftrace_dump() function and totally eliminate the
duplication.  This sounds very appealing and actually works (the "sr
z" command can be seen to properly dump the ftrace buffer).  The
downside here is that ftrace_dump() fully consumes the trace buffer.
Unless that is changed I'd rather not use it because it means "ftdump
| grep xyz" won't be very useful to search the ftrace buffer since it
will throw away the whole trace on the first grep.  A future patch to
dump only the last few lines of the buffer will also be hard to
implement.

[1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com

Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org

Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-13 09:46:10 -04:00
Linus Torvalds
7b47a9e7c8 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount infrastructure updates from Al Viro:
 "The rest of core infrastructure; no new syscalls in that pile, but the
  old parts are switched to new infrastructure. At that point
  conversions of individual filesystems can happen independently; some
  are done here (afs, cgroup, procfs, etc.), there's also a large series
  outside of that pile dealing with NFS (quite a bit of option-parsing
  stuff is getting used there - it's one of the most convoluted
  filesystems in terms of mount-related logics), but NFS bits are the
  next cycle fodder.

  It got seriously simplified since the last cycle; documentation is
  probably the weakest bit at the moment - I considered dropping the
  commit introducing Documentation/filesystems/mount_api.txt (cutting
  the size increase by quarter ;-), but decided that it would be better
  to fix it up after -rc1 instead.

  That pile allows to do followup work in independent branches, which
  should make life much easier for the next cycle. fs/super.c size
  increase is unpleasant; there's a followup series that allows to
  shrink it considerably, but I decided to leave that until the next
  cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
  afs: Use fs_context to pass parameters over automount
  afs: Add fs_context support
  vfs: Add some logging to the core users of the fs_context log
  vfs: Implement logging through fs_context
  vfs: Provide documentation for new mount API
  vfs: Remove kern_mount_data()
  hugetlbfs: Convert to fs_context
  cpuset: Use fs_context
  kernfs, sysfs, cgroup, intel_rdt: Support fs_context
  cgroup: store a reference to cgroup_ns into cgroup_fs_context
  cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
  cgroup_do_mount(): massage calling conventions
  cgroup: stash cgroup_root reference into cgroup_fs_context
  cgroup2: switch to option-by-option parsing
  cgroup1: switch to option-by-option parsing
  cgroup: take options parsing into ->parse_monolithic()
  cgroup: fold cgroup1_mount() into cgroup1_get_tree()
  cgroup: start switching to fs_context
  ipc: Convert mqueue fs to fs_context
  proc: Add fs_context support to procfs
  ...
2019-03-12 14:08:19 -07:00
Linus Torvalds
5f739e4a49 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted fixes (really no common topic here)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Make __vfs_write() static
  vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
  pipe: stop using ->can_merge
  splice: don't merge into linked buffers
  fs: move generic stat response attr handling to vfs_getattr_nosec
  orangefs: don't reinitialize result_mask in ->getattr
  fs/devpts: always delete dcache dentry-s in dput()
2019-03-12 13:27:20 -07:00