Since commit caeb178c60 ("sched/fair: Make update_sd_pick_busiest() ...")
sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
but is only more loaded than others. This change has been introduced to ensure
a better load balance in system that are not overloaded but as a side effect,
it can also generate useless active migration between groups.
Let take the example of 3 tasks on a quad cores system. We will always have an
idle core so the load balance will find a busiest group (core) whenever an ILB
is triggered and it will force an active migration (once above
nr_balance_failed threshold) so the idle core becomes busy but another core
will become idle. With the next ILB, the freshly idle core will try to pull the
task of a busy CPU.
The number of spurious active migration is not so huge in quad core system
because the ILB is not triggered so much. But it becomes significant as soon as
you have more than one sched_domain level like on a dual cluster of quad cores
where the ILB is triggered every tick when you have more than 1 busy_cpu
We need to ensure that the migration generate a real improveùent and will not
only move the avg_load imbalance on another CPU.
Before caeb178c60, the filtering of such use
case was ensured by the following test in f_b_g:
if ((local->idle_cpus < busiest->idle_cpus) &&
busiest->sum_nr_running <= busiest->group_weight)
This patch modified the condition to take into account situation where busiest
group is not overloaded: If the diff between the number of idle cpus in 2
groups is less than or equal to 1 and the busiest group is not overloaded,
moving a task will not improve the load balance but just move it.
A test with sysbench on a dual clusters of quad cores gives the following
results:
command: sysbench --test=cpu --num-threads=5 --max-time=5 run
The HZ is 200 which means that 1000 ticks has fired during the test.
With Mainline, perf gives the following figures:
Samples: 727 of event 'sched:sched_migrate_task'
Event count (approx.): 727
Overhead Command Shared Object Symbol
........ ............... ............. ..............
12.52% migration/1 [unknown] [.] 00000000
12.52% migration/5 [unknown] [.] 00000000
12.52% migration/7 [unknown] [.] 00000000
12.10% migration/6 [unknown] [.] 00000000
11.83% migration/0 [unknown] [.] 00000000
11.83% migration/3 [unknown] [.] 00000000
11.14% migration/4 [unknown] [.] 00000000
10.87% migration/2 [unknown] [.] 00000000
2.75% sysbench [unknown] [.] 00000000
0.83% swapper [unknown] [.] 00000000
0.55% ktps65090charge [unknown] [.] 00000000
0.41% mmcqd/1 [unknown] [.] 00000000
0.14% perf [unknown] [.] 00000000
With this patch, perf gives the following figures
Samples: 20 of event 'sched:sched_migrate_task'
Event count (approx.): 20
Overhead Command Shared Object Symbol
........ ............... ............. ..............
80.00% sysbench [unknown] [.] 00000000
10.00% swapper [unknown] [.] 00000000
5.00% ktps65090charge [unknown] [.] 00000000
5.00% migration/1 [unknown] [.] 00000000
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS. It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:
161270fc1f ("x86/smp: Fix topology checks on AMD MCM CPUs")
Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.
AMD special-cased their system by looking for a cpuid flag. The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process. In other words, we have to trust
the ACPI tables <shudder>.
This detects the situation where a NUMA node occurs at a place in
the middle of the "CPU" sched domains. It replaces the default
topology with one that relies on the NUMA information from the
firmware (SRAT table) for all levels of sched domains above the
hyperthreads.
This also fixes a sysfs bug. We used to freak out when we saw
the "mc" group cross a node boundary, so we stopped building the
MC group. MC gets exported as the 'core_siblings_list' in
/sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
the same 'physical_package_id' to not be listed together in
'core_siblings_list'. This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:
core_siblings: internal kernel map of cpu#'s hardware threads
within the same physical_package_id.
core_siblings_list: human-readable list of the logical CPU
numbers within the same physical_package_id as cpu#.
The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.
Before this patch, there are two packages:
# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
18 0
18 1
But 4 _sets_ of core siblings:
# cat cpu*/topology/core_siblings_list | sort | uniq -c
9 0-8
9 18-26
9 27-35
9 9-17
After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.
# cat cpu*/topology/physical_package_id | sort | uniq -c
18 0
18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
18 0-17
18 18-35
Example spew:
...
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
#2#3#4#5#6#7#8
.... node #1, CPUs: #9
------------[ cut here ]------------
WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
Modules linked in:
CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
Call Trace:
[<ffffffff8172e485>] dump_stack+0x45/0x56
[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
[<ffffffff8107568d>] start_secondary+0x1ad/0x240
---[ end trace 3fe5f587a9fcde61 ]---
#10#11#12#13#14#15#16#17
.... node #2, CPUs: #18#19#20#21#22#23#24#25#26
.... node #3, CPUs: #27#28#29#30#31#32#33#34#35
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Added LLC domain and s/match_mc/match_die/ ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: brice.goglin@gmail.com
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The code in find_idlest_cpu() looks for the CPU with the smallest load.
However, if multiple CPUs are idle, the first idle CPU is selected
irrespective of the depth of its idle state.
Among the idle CPUs we should pick the one with with the shallowest idle
state, or the latest to have gone idle if all idle CPUs are in the same
state. The later applies even when cpuidle is configured out.
This patch doesn't cover the following issues:
- The idle exit latency of a CPU might be larger than the time needed
to migrate the waking task to an already running CPU with sufficient
capacity, and therefore performance would benefit from task packing
in such case (in most cases task packing is about power saving).
- Some idle states have a non negligible and non abortable entry latency
which needs to run to completion before the exit latency can start.
A concurrent patch series is making this info available to the cpuidle
core. Once available, the entry latency with the idle timestamp could
determine when the exit latency may be effective.
Those issues will be handled in due course. In the mean time, what
is implemented here should improve things already compared to the current
state of affairs.
Based on an initial patch from Daniel Lezcano.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the cpu enters idle, it stores the cpuidle state pointer in its
struct rq instance which in turn could be used to make a better decision
when balancing tasks.
As soon as the cpu exits its idle state, the struct rq reference is
cleared.
There are a couple of situations where the idle state pointer could be changed
while it is being consulted:
1. For x86/acpi with dynamic c-states, when a laptop switches from battery
to AC that could result on removing the deeper idle state. The acpi driver
triggers:
'acpi_processor_cst_has_changed'
'cpuidle_pause_and_lock'
'cpuidle_uninstall_idle_handler'
'kick_all_cpus_sync'.
All cpus will exit their idle state and the pointed object will be set to
NULL.
2. The cpuidle driver is unloaded. Logically that could happen but not
in practice because the drivers are always compiled in and 95% of them are
not coded to unregister themselves. In any case, the unloading code must
call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
leading to 'kick_all_cpus_sync' as mentioned above.
A race can happen if we use the pointer and then one of these two scenarios
occurs at the same moment.
In order to be safe, the idle state pointer stored in the rq must be
used inside a rcu_read_lock section where we are protected with the
'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
idle_get_state() and idle_put_state() accessors should be used to that
effect.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In wake_affine() I have tried to understand the meaning of the condition:
(this_load <= load &&
this_load + target_load(prev_cpu, idx) <= tl_per_task)
but I failed to find a use case that can take advantage of it and I haven't
found clear description in the previous commit's log.
Futhermore, the comment of the condition refers to the task_hot function that
was used before being replaced by the current condition:
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
* there is no bad imbalance.
*/
If we look more deeply the below condition:
this_load + target_load(prev_cpu, idx) <= tl_per_task
When sync is clear, we have:
tl_per_task = runnable_load_avg / nr_running
this_load = max(runnable_load_avg, cpuload[idx])
target_load = max(runnable_load_avg', cpuload'[idx])
It implies that runnable_load_avg == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load.
but if this _load is null, 'balanced' is already set and the test is redundant.
If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.
The current conclusion is that these additional condition don't give any benefit
so we can remove them.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced
state. If all tasks are pinned, we don't consider that as a balanced state and
let the imbalance flag set.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The code in task_numa_compare() will only examine at most one idle CPU per node,
because they all have the same score. However, some idle CPUs are better
candidates than others, due to busy or idle SMT siblings, etc...
The scheduler has logic to find the best CPU within an LLC to place a
task. The NUMA code should probably use it.
This seems to reduce the standard deviation for single instance SPECjbb2005
with a low warehouse count on my 4 node test system.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140904163530.189d410a@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current code can fail to migrate a waking task (silently) when TTWU_QUEUE is
enabled.
When a task is waking, it is pending on the wake_list of the rq, but it is not
queued (task->on_rq == 0). In this case, set_cpus_allowed_ptr() and
__migrate_task() will not migrate it because its invisible to them.
This behavior is incorrect, because the task has been already woken, it will be
running on the wrong CPU without correct placement until the next wake-up or
update for cpus_allowed.
To fix this problem, we need to finish the wakeup (so they appear on
the runqueue) before we migrate them.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Tested-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/538ED7EB.5050303@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>