Files
linux/include/uapi/linux
Alexei Starovoitov 6c90598174 bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
..
2014-10-20 10:30:15 +08:00
2015-10-13 17:42:34 +02:00
2014-09-23 23:15:46 -07:00
2015-04-21 16:16:02 -04:00
2016-03-01 17:15:22 -05:00
2015-10-05 03:16:46 -07:00
2016-03-08 15:28:31 -05:00
2014-07-01 10:48:05 -06:00
2014-03-20 14:55:18 +01:00
2016-03-01 16:07:29 -05:00
2015-08-25 14:41:50 -05:00
2014-11-19 14:48:02 -06:00
2016-01-20 17:09:18 -08:00
2014-08-08 15:57:31 -07:00
2015-04-12 21:25:13 -04:00
2015-12-14 12:19:37 -06:00
2015-10-26 19:06:48 -07:00
2015-05-25 13:25:35 -04:00
2014-12-02 13:52:53 -07:00
2015-02-17 14:34:51 -08:00
2016-01-12 08:21:18 -07:00
2015-08-03 17:30:00 -07:00
2014-08-08 15:57:31 -07:00
2015-07-21 10:39:05 -07:00
2015-11-23 15:06:38 -05:00
2015-01-20 13:51:06 +01:00
2014-12-13 12:42:52 -08:00
2015-05-05 13:40:44 -06:00
2015-02-02 18:46:51 -08:00
2015-08-27 19:44:53 -04:00
2015-04-23 15:16:14 -04:00
2015-07-20 14:58:46 -04:00
2016-02-24 09:04:41 +01:00
2016-01-10 22:13:15 -05:00
2015-09-04 16:54:41 -07:00
2014-03-21 14:21:13 -04:00
2014-12-10 17:41:10 -08:00
2014-12-09 14:10:41 +08:00
2015-03-05 22:08:42 -05:00
2014-04-26 12:13:24 -04:00
2015-05-24 12:49:16 -07:00
2015-06-11 23:37:37 +02:00
2015-12-18 17:48:51 -08:00
2015-10-27 18:55:31 -07:00
2015-12-21 15:28:11 -07:00
2015-03-29 12:13:52 +10:30
2014-12-09 12:05:24 +02:00
2015-03-31 13:58:35 -04:00