Otherwise this logic only works if mode is "compatible" with another
exclusive waiter.
If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
abort_exclusive_wait() won't wait an uninterruptible waiter.
The main user is __wait_on_bit_lock() and currently it is fine but only
because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
lock_page_interruptible() yet.
Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
Yes, this means that (say) wake_up_interruptible() can wake up the non-
interruptible waiter(s), but I think this is fine. And in fact I think
that abort_exclusive_wait() must die, see the next change.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime).
It makes use of a per-CPU hrtimer resource and hence arming that
hrtimer should be based on total SCHED_FAIR tasks a CPU has across its
various cfs_rqs, rather than being based on number of tasks in a
particular cfs_rq (as implemented currently).
As a result, with current code, its possible for a running task (which
is the sole task in its cfs_rq) to be preempted much after its
ideal_runtime has elapsed, resulting in increased latency for tasks in
other cfs_rq on same CPU.
Fix this by arming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1474075731-11550-1-git-send-email-joonwoop@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The dl task will be replenished after dl task timer fire and start a
new period. It will be enqueued and to re-evaluate its dependency on
the tick in order to restart it. However, if the CPU is hot-unplugged,
irq_work_queue will splash since the target CPU is offline.
As a result we get:
WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0
Call Trace:
dump_stack+0x99/0xd0
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
irq_work_queue_on+0xad/0xe0
tick_nohz_full_kick_cpu+0x44/0x50
tick_nohz_dep_set_cpu+0x74/0xb0
enqueue_task_dl+0x226/0x480
activate_task+0x5c/0xa0
dl_task_timer+0x19b/0x2c0
? push_dl_task.part.31+0x190/0x190
This can be triggered by hot-unplugging the full dynticks CPU which dl
task is running on.
We enqueue the dl task on the offline CPU, because we need to do
replenish for start_dl_timer(). So, as Juri pointed out, we would
need to do is calling replenish_dl_entity() directly, instead of
enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't
be boosted if it was throttled.
This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by
first migrating (set_task_cpu()) and then doing 1 enqueue.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.
The task being woken up is already awake from a schedule()
and is doing the following:
do {
schedule()
set_current_state(TASK_(UN)INTERRUPTIBLE);
} while (!cond);
The waker, actually gets stuck doing the following in
try_to_wake_up():
while (p->on_cpu)
cpu_relax();
Analysis:
The instance I've seen involves the following race:
CPU1 CPU2
while () {
if (cond)
break;
do {
schedule();
set_current_state(TASK_UN..)
} while (!cond);
wakeup_routine()
spin_lock_irqsave(wait_lock)
raw_spin_lock_irqsave(wait_lock) wake_up_process()
} try_to_wake_up()
set_current_state(TASK_RUNNING); ..
list_del(&waiter.list);
CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:
CPU3
wakeup_routine()
raw_spin_lock_irqsave(wait_lock)
if (!list_empty)
wake_up_process()
try_to_wake_up()
raw_spin_lock_irqsave(p->pi_lock)
..
if (p->on_rq && ttwu_wakeup())
..
while (p->on_cpu)
cpu_relax()
..
CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.
CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.
The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
Reproduction of the issue:
The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.
Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
architectures do not have a full barrier in switch_to()
so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
d670ec1317 ("posix-cpu-timers: Cure SMP wobbles")
started accounting thread group tasks pending runtime in thread_group_cputime().
Another commit:
6e998916df ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
updated scheduler runtime statistics (call update_curr()) when reading task pending
runtime. Those changes cause bad performance of SYS_times() and
SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on
larger systems with many CPUs.
While we would like to have cpuclock monotonicity kept i.e. have
problems fixed by above commits stay fixed, we also would like to have
good performance.
However when we notice that change from commit d670ec1317 is not
longer needed to solve problem addressed by that commit, because of
change from the second commit 6e998916df, we can get room for
optimization. Since we update task while reading it's pending runtime
in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will
see updated values and on testcase from d670ec1317 process cpuclock
will not be smaller than thread cpuclock.
I tested the patch on testcases from commits d670ec1317,
6e998916df and some other cpuclock/cputimers testcases and
did not found cpuclock monotonicity problems or other malfunction.
This patch has the drawback that we will not provide thread group cputime
up-to-date to the last moment. For example when arming cputime timer,
we will arm it with possibly a bit outdated values and that timer will
trigger earlier compared to behaviour without the patch. However that
was the behaviour before d670ec1317 commit (kernel v3.1) so it's
unlikely to affect applications.
Patch improves related syscall performance, as measured by Giovanni's
benchmarks described in commit:
6075620b05 ("sched/cputime: Mitigate performance regression in times()/clock_gettime()")
The benchmark results are:
SYS_clock_gettime():
threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch
(pre-6e998916dfe3)
2 3.48 2.23 ( 35.68%) 3.06 ( 11.83%) 1.08 ( 68.81%)
5 3.33 2.83 ( 14.84%) 3.25 ( 2.40%) 0.71 ( 78.55%)
8 3.37 2.84 ( 15.80%) 3.26 ( 3.30%) 0.56 ( 83.49%)
12 3.32 3.09 ( 6.69%) 3.37 ( -1.60%) 0.42 ( 87.28%)
21 4.01 3.14 ( 21.70%) 3.90 ( 2.74%) 0.35 ( 91.35%)
30 3.63 3.28 ( 9.75%) 3.36 ( 7.41%) 0.28 ( 92.23%)
48 3.71 3.02 ( 18.69%) 3.11 ( 16.27%) 0.39 ( 89.39%)
79 3.75 2.88 ( 23.23%) 3.16 ( 15.74%) 0.46 ( 87.76%)
110 3.81 2.95 ( 22.62%) 3.25 ( 14.80%) 0.56 ( 85.41%)
128 3.88 3.05 ( 21.28%) 3.31 ( 14.76%) 0.62 ( 84.10%)
SYS_times():
threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch
(pre-6e998916dfe3)
2 3.65 2.27 ( 37.94%) 3.25 ( 11.03%) 1.62 ( 55.71%)
5 3.45 2.78 ( 19.34%) 3.17 ( 7.92%) 2.33 ( 32.28%)
8 3.52 2.79 ( 20.66%) 3.22 ( 8.69%) 2.06 ( 41.44%)
12 3.29 3.02 ( 8.33%) 3.36 ( -2.04%) 2.00 ( 39.18%)
21 4.07 3.10 ( 23.86%) 3.92 ( 3.78%) 2.07 ( 49.18%)
30 3.87 3.33 ( 13.80%) 3.40 ( 12.17%) 1.89 ( 51.12%)
48 3.79 2.96 ( 21.94%) 3.16 ( 16.61%) 1.69 ( 55.46%)
79 3.88 2.88 ( 25.82%) 3.28 ( 15.42%) 1.60 ( 58.81%)
110 3.90 2.98 ( 23.73%) 3.38 ( 13.35%) 1.73 ( 55.61%)
128 4.00 3.10 ( 22.40%) 3.38 ( 15.45%) 1.66 ( 58.52%)
Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a topology flag to the sched_domain hierarchy indicating the lowest
domain level where the full range of CPU capacities is represented by
the domain members for asymmetric capacity topologies (e.g. ARM
big.LITTLE).
The flag is intended to indicate that extra care should be taken when
placing tasks on CPUs and this level spans all the different types of
CPUs found in the system (no need to look further up the domain
hierarchy). This information is currently only available through
iterating through the capacities of all the CPUs at parent levels in the
sched_domain hierarchy.
SD 2 [ 0 1 2 3] SD_ASYM_CPUCAPACITY
SD 1 [ 0 1] [ 2 3] !SD_ASYM_CPUCAPACITY
CPU: 0 1 2 3
capacity: 756 756 1024 1024
If the topology in the example above is duplicated to create an eight
CPU example with third sched_domain level on top (SD 3), this level
should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
would both have all CPU capacities represented within them.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
5743021831 ("sched/cputime: Count actually elapsed irq & softirq time")
... fixed a bug but also triggered a regression:
On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
on host one by one until there is only one left, then observe CPU utilization
via 'top' in the guest, it shows:
100% st for cpu0(housekeeping)
75% st for other CPUs (nohz full mode)
However, w/o this commit it shows the correct 75% for all four CPUs.
When a guest is interrupted for a longer amount of time, missed clock ticks
are not redelivered later. Because of that, we should not limit the amount
of steal time accounted to the amount of time that the calling functions
think have passed.
However, the interval returned by account_other_time() is NOT rounded down
to the nearest jiffy, while the base interval in get_vtime_delta() it is
subtracted from is, so the max cputime limit is required to avoid underflow.
This patch fixes the regression by limiting the account_other_time() from
get_vtime_delta() to avoid underflow, and lets the other three call sites
(in account_other_time() and steal_account_process_time()) account however
much steal time the host told us elapsed.
Suggested-by: Rik van Riel <riel@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike reports:
Roughly 10% of the time, ltp testcase getrusage04 fails:
getrusage04 0 TINFO : Expected timers granularity is 4000 us
getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)!
getrusage04 0 TINFO : utime: 0us; stime: 179us
getrusage04 0 TINFO : utime: 3751us; stime: 0us
getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us:
And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.
Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org # 4.3+
Fixes: 9d7fb04276 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This message is currently really useless since it always prints a value
that comes from the printk() we just did, e.g.:
BUG: sleeping function called from invalid context at mm/slab.h:388
in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
BUG: sleeping function called from invalid context at include/linux/freezer.h:56
in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
Here, both down_trylock() and console_unlock() is somewhere in the
printk() path.
We should save the value before calling printk() and use the saved value
instead. That immediately reveals the offending callsite:
BUG: sleeping function called from invalid context at mm/slab.h:388
in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150
Bug report:
http://marc.info/?l=linux-netdev&m=146925979821849&w=2
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russel <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
setup_new_dl_entity() takes two parameters, but it only actually uses
one of them, under a different name, to setup a new dl_entity, after:
2f9f3fdc928 "sched/deadline: Remove dl_new from struct sched_dl_entity"
as we currently do:
setup_new_dl_entity(&p->dl, &p->dl)
However, before Luca's change we were doing:
setup_new_dl_entity(dl_se, pi_se)
in update_dl_entity() for a dl_se->new entity: we were using pi_se's
parameters (the potential PI donor) for setting up a new entity.
This change removes the useless second parameter of setup_new_dl_entity().
While we are at it we also optimize things further calling setup_new_dl_
entity() only for already queued tasks, since (as pointed out by Xunlei)
we already do the very same update at tasks wakeup time anyway. By doing
so, we don't need to worry about a potential PI donor anymore, as
rt_mutex_setprio() takes care of that already for us.
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xunlei Pang <xpang@redhat.com>
Link: http://lkml.kernel.org/r/1470409675-20935-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vincent noted that the update_tg_load_avg() usage in commit:
3d30544f02 ("sched/fair: Apply more PELT fixes")
isn't entirely sufficient. We need to call this function every time
cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
returns true, but {attach,detach}_entity_load_avg() themselves also
change it. This means we need to unconditionally call
update_tg_load_avg().
Also, add more comments.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:
WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0
releasing a pinned lock
Call Trace:
dump_stack+0x99/0xd0
__warn+0xd1/0xf0
? dl_task_timer+0x1a1/0x2b0
warn_slowpath_fmt+0x4f/0x60
? sched_clock+0x13/0x20
lock_release+0x690/0x6a0
? enqueue_pushable_dl_task+0x9b/0xa0
? enqueue_task_dl+0x1ca/0x480
_raw_spin_unlock+0x1f/0x40
dl_task_timer+0x1a1/0x2b0
? push_dl_task.part.31+0x190/0x190
WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0
unpinning an unpinned lock
Call Trace:
dump_stack+0x99/0xd0
__warn+0xd1/0xf0
warn_slowpath_fmt+0x4f/0x60
lock_unpin_lock+0x181/0x1a0
dl_task_timer+0x127/0x2b0
? push_dl_task.part.31+0x190/0x190
As per the comment before this code, its safe to drop the RQ lock
here, and since we (potentially) change rq, unpin and repin to avoid
the splat.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[ Rewrote changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
6e998916df ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with:
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916df
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916df 4.7-rc7
6e998916df itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916df 4.7-rc7
6e998916df itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Suggested-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>