Commit Graph

6339 Commits

Author SHA1 Message Date
stephen hemminger
6261d983f2 ip_tunnel: embed hash list head
The IP tunnel hash heads can be embedded in the per-net structure
since it is a fixed size. Reduce the size so that the total structure
fits in a page size. The original size was overly large, even NETDEV_HASHBITS
is only 8 bits!

Also, add some white space for readability.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Pravin B Shelar <pshelar@nicira.com>.
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 16:47:52 -07:00
fan.du
5a139296f8 sctp: Pack dst_cookie into 1st cacheline hole for 64bit host
As dst_cookie is used in fast path sctp_transport_dst_check.

Before:
struct sctp_transport {
	struct list_head           transports;           /*     0    16 */
	atomic_t                   refcnt;               /*    16     4 */
	__u32                      dead:1;               /*    20:31  4 */
	__u32                      rto_pending:1;        /*    20:30  4 */
	__u32                      hb_sent:1;            /*    20:29  4 */
	__u32                      pmtu_pending:1;       /*    20:28  4 */

	/* XXX 28 bits hole, try to pack */

	__u32                      sack_generation;      /*    24     4 */

	/* XXX 4 bytes hole, try to pack */

	struct flowi               fl;                   /*    32    64 */
	/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
	union sctp_addr            ipaddr;               /*    96    28 */

After:
struct sctp_transport {
	struct list_head           transports;           /*     0    16 */
	atomic_t                   refcnt;               /*    16     4 */
	__u32                      dead:1;               /*    20:31  4 */
	__u32                      rto_pending:1;        /*    20:30  4 */
	__u32                      hb_sent:1;            /*    20:29  4 */
	__u32                      pmtu_pending:1;       /*    20:28  4 */

	/* XXX 28 bits hole, try to pack */

	__u32                      sack_generation;      /*    24     4 */
	u32                        dst_cookie;           /*    28     4 */
	struct flowi               fl;                   /*    32    64 */
	/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
	union sctp_addr            ipaddr;               /*    96    28 */

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05 12:20:51 -07:00
David S. Miller
0e76a3a587 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 21:36:46 -07:00
Eric Dumazet
fba3679d34 fib_rules: reorder struct fib_rules fields
Move refcnt, pref, suppress_ifgroup, suppress_prefixlen out of first
cache line, as they are not used in fast path.

Make sure ctarget & fr_net are in first cache line.

(Assuming 64 bit arches and 64 bytes cache lines)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 11:53:54 -07:00
Stefan Tomanek
73f5698e77 fib_rules: fix suppressor names and default values
This change brings the suppressor attribute names into line; it also changes
the data types to provide a more consistent interface.

While -1 indicates that the suppressor is not enabled, values >= 0 for
suppress_prefixlen or suppress_ifgroup  reject routing decisions violating the
constraint.

This changes the previously presented behaviour of suppress_prefixlen, where a
prefix length _less_ than the attribute value was rejected. After this change,
a prefix length less than *or* equal to the value is considered a violation of
the rule constraint.

It also changes the default values for default and newly added rules (disabling
any suppression for those).

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 10:40:23 -07:00
Stefan Tomanek
6ef94cfafb fib_rules: add route suppression based on ifgroup
This change adds the ability to suppress a routing decision based upon the
interface group the selected interface belongs to. This allows it to
exclude specific devices from a routing decision.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 15:24:22 -07:00
fan.du
d27fc78208 sctp: Don't lookup dst if transport dst is still valid
When sctp sits on IPv6, sctp_transport_dst_check pass cookie as ZERO,
as a result ip6_dst_check always fail out. This behaviour makes
transport->dst useless, because every sctp_packet_transmit must look
for valid dst.

Add a dst_cookie into sctp_transport, and set the cookie whenever we
get new dst for sctp_transport. So dst validness could be checked
against it.

Since I have split genid for IPv4 and IPv6, also delete/add IPv6 address
will also bump IPv6 genid. So issues we discussed in:
http://marc.info/?l=linux-netdev&m=137404469219410&w=4
have all been sloved for this patch.

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 12:36:00 -07:00
Joe Perches
574e2af7c0 include: Convert ethernet mac address declarations to use ETH_ALEN
It's convenient to have ethernet mac addresses use
ETH_ALEN to be able to grep for them a bit easier and
also to ensure that the addresses are __aligned(2).

Add #include <linux/if_ether.h> as necessary.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 12:33:54 -07:00
Cong Wang
e0d1095ae3 net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.

Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 15:11:17 -07:00
Cong Wang
dfcefb0be1 net: fix a compile error when CONFIG_NET_LL_RX_POLL is not set
When CONFIG_NET_LL_RX_POLL is not set, I got:

net/socket.c: In function ‘sock_poll’:
net/socket.c:1165:4: error: implicit declaration of function ‘sk_busy_loop’ [-Werror=implicit-function-declaration]

Fix this by adding a nop when !CONFIG_NET_LL_RX_POLL.

Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 15:10:58 -07:00
Michal Kubeček
2ac3ac8f86 ipv6: prevent fib6_run_gc() contention
On a high-traffic router with many processors and many IPv6 dst
entries, soft lockup in fib6_run_gc() can occur when number of
entries reaches gc_thresh.

This happens because fib6_run_gc() uses fib6_gc_lock to allow
only one thread to run the garbage collector but ip6_dst_gc()
doesn't update net->ipv6.ip6_rt_last_gc until fib6_run_gc()
returns. On a system with many entries, this can take some time
so that in the meantime, other threads pass the tests in
ip6_dst_gc() (ip6_rt_last_gc is still not updated) and wait for
the lock. They then have to run the garbage collector one after
another which blocks them for quite long.

Resolve this by replacing special value ~0UL of expire parameter
to fib6_run_gc() by explicit "force" parameter to choose between
spin_lock_bh() and spin_trylock_bh() and call fib6_run_gc() with
force=false if gc_thresh is reached but not max_size.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 14:16:20 -07:00
John W. Linville
22e02a0272 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2013-08-01 14:30:59 -04:00
Joe Perches
378307217e cls_cgroup.h netprio_cgroup.h: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:02 -07:00
Joe Perches
4fc7074703 checksum: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:02 -07:00
Joe Perches
10dd9b7ce3 cfg80211.h/mac80211.h: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:02 -07:00
Joe Perches
c1d8f8041c ax25.h: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:02 -07:00
Joe Perches
90972b2211 arp/neighbour.h: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:02 -07:00
Joe Perches
cd2cf63a56 af_rxrpc.h: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:01 -07:00
Joe Perches
b60a8280ba af_unix.h: Remove extern from function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:01 -07:00
Joe Perches
e8e54d3c13 addrconf.h: Remove extern function prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:50:01 -07:00
Stefan Tomanek
7764a45a8f fib_rules: add .suppress operation
This change adds a new operation to the fib_rules_ops struct; it allows the
suppression of routing decisions if certain criteria are not met by its
results.

The first implemented constraint is a minimum prefix length added to the
structures of routing rules. If a rule is added with a minimum prefix length
>0, only routes meeting this threshold will be considered. Any other (more
general) routing table entries will be ignored.

When configuring a system with multiple network uplinks and default routes, it
is often convinient to reference the main routing table multiple times - but
omitting the default route. Using this patch and a modified "ip" utility, this
can be achieved by using the following command sequence:

  $ ip route add table secuplink default via 10.42.23.1

  $ ip rule add pref 100            table main prefixlength 1
  $ ip rule add pref 150 fwmark 0xA table secuplink

With this setup, packets marked 0xA will be processed by the additional routing
table "secuplink", but only if no suitable route in the main routing table can
be found. By using a minimal prefixlength of 1, the default route (/0) of the
table "main" is hidden to packets processed by rule 100; packets traveling to
destinations with more specific routing entries are processed as usual.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:27:17 -07:00
Joe Perches
5c15257f93 net: Remove extern from include/net/ scheduling prototypes
There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:24:22 -07:00
Joe Perches
d9d10a3096 ndisc: Add missing inline to ndisc_addr_option_pad
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 15:18:17 -07:00
Eric Dumazet
f2f872f927 netem: Introduce skb_orphan_partial() helper
Commit 547669d483 ("tcp: xps: fix reordering issues") added
unexpected reorders in case netem is used in a MQ setup for high
performance test bed.

ETH=eth0
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 32`
do
 tc qd add dev $ETH parent 1:$i netem delay 100ms
done

As all tcp packets are orphaned by netem, TCP stack believes it can
set skb->ooo_okay on all packets.

In order to allow producers to send more packets, we want to
keep sk_wmem_alloc from reaching sk_sndbuf limit.

We can do that by accounting one byte per skb in netem queues,
so that TCP stack is not fooled too much.

Tested:

With above MQ/netem setup, scaling number of concurrent flows gives
linear results and no reorders/retransmits

lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
 do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
n:1 198.46
n:10 2002.69
n:20 4000.98
n:30 6006.35
n:40 8020.93
n:50 10032.3
n:60 12081.9
n:70 13971.3
n:80 16009.7
n:90 17117.3
n:100 17425.5

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 14:59:49 -07:00
fan.du
ca4c3fc24e net: split rt_genid for ipv4 and ipv6
Current net name space has only one genid for both IPv4 and IPv6, it has below
drawbacks:

- Add/delete an IPv4 address will invalidate all IPv6 routing table entries.
- Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table
  entries even when the policy is only applied for one address family.

Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6
separately in a fine granularity.

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 14:56:36 -07:00
Dmitry Popov
c0155b2da4 tcp: Remove unused tcpct declarations and comments
Remove declaration, 4 defines and confusing comment that are no longer used
since 1a2c6181c4 ("tcp: Remove TCPCT").

Signed-off-by: Dmitry Popov <dp@highloadlab.com>
Acked-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 12:16:45 -07:00
Samuel Ortiz
9ea7187c53 NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD
Loading a firmware into a target is typically called firmware
download, not firmware upload. So we rename the netlink API to
NFC_CMD_FW_DOWNLOAD in order to avoid any terminology confusion from
userspace.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2013-07-31 01:19:43 +02:00
Andi Shyti
60ff779c4a 9p: client: remove unused code and any reference to "cancelled" function
This patch reverts commit

80b45261a0

which was implementing a 'cancelled' functionality to notify that
a cancelled request will not be replied.

This implementation was not used anywhere and therefore removed.

Signed-off-by: Andi Shyti <andi@etezian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 15:54:28 -07:00
Thomas Graf
c26bf4a513 pktgen: Add UDPCSUM flag to support UDP checksums
UDP checksums are optional, hence pktgen has been omitting them in
favour of performance. The optional flag UDPCSUM enables UDP
checksumming. If the output device supports hardware checksumming
the skb is prepared and marked CHECKSUM_PARTIAL, otherwise the
checksum is generated in software.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27 22:16:36 -07:00
Asias He
82a54d0ebb VSOCK: Move af_vsock.h and vsock_addr.h to include/net
This is useful for other VSOCK transport implemented outside the
net/vmw_vsock/ directory to use these headers.

Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Andy King <acking@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27 22:14:06 -07:00
Joe Stringer
024ec3deac net/sctp: Refactor SCTP skb checksum computation
This patch consolidates the SCTP checksum calculation code from various
places to a single new function, sctp_compute_cksum(skb, offset).

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27 20:07:15 -07:00
Eric Dumazet
c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Eric Dumazet
64dc61306c net: add sk_stream_is_writeable() helper
Several call sites use the hardcoded following condition :

sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

Lets use a helper because TCP_NOTSENT_LOWAT support will change this
condition for TCP sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Daniel Borkmann
91705c61b5 net: sctp: trivial: update mailing list address
The SCTP mailing list address to send patches or questions
to is linux-sctp@vger.kernel.org and not
lksctp-developers@lists.sourceforge.net anymore. Therefore,
update all occurences.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:53:38 -07:00
Yuchung Cheng
5b08e47caf tcp: prefer packet timing to TS-ECR for RTT
Prefer packet timings to TS-ecr for RTT measurements when both
sources are available. That's because broken middle-boxes and remote
peer can return packets with corrupted TS ECR fields. Similarly most
congestion controls that require RTT signals favor timing-based
sources as well. Also check for bad TS ECR values to avoid RTT
blow-ups. It has happened on production Web servers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Yuchung Cheng
375fe02c91 tcp: consolidate SYNACK RTT sampling
The first patch consolidates SYNACK and other RTT measurement to use a
central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
RTT measurement happens after PAWS check, potentially reducing the
impact of RTO seeding on bad TCP timestamps values.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Richard Cochran
cb820f8e4b net: Provide a generic socket error queue delivery method for Tx time stamps.
This patch moves the private error queue delivery function from the
af_packet code to the core socket method. In this way, network layers
only needing the error queue for transmit time stamping can share common
code.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 14:58:19 -07:00
Linus Torvalds
be9c6d9169 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Just a bunch of small fixes and tidy ups:

   1) Finish the "busy_poll" renames, from Eliezer Tamir.

   2) Fix RCU stalls in IFB driver, from Ding Tianhong.

   3) Linearize buffers properly in tun/macvtap zerocopy code.

   4) Don't crash on rmmod in vxlan, from Pravin B Shelar.

   5) Spinlock used before init in alx driver, from Maarten Lankhorst.

   6) A sparse warning fix in bnx2x broke TSO checksums, fix from Dmitry
      Kravkov.

   7) Dummy and ifb driver load failure paths can oops, fixes from Tan
      Xiaojun and Ding Tianhong.

   8) Correct MTU calculations in IP tunnels, from Alexander Duyck.

   9) Account all TCP retransmits in SNMP stats properly, from Yuchung
      Cheng.

  10) atl1e and via-rhine do not handle DMA mapping failures properly,
      from Neil Horman.

  11) Various equal-cost multipath route fixes in ipv6 from Hannes
      Frederic Sowa"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (36 commits)
  ipv6: only static routes qualify for equal cost multipathing
  via-rhine: fix dma mapping errors
  atl1e: fix dma mapping warnings
  tcp: account all retransmit failures
  usb/net/r815x: fix cast to restricted __le32
  usb/net/r8152: fix integer overflow in expression
  net: access page->private by using page_private
  net: strict_strtoul is obsolete, use kstrtoul instead
  drivers/net/ieee802154: don't use devm_pinctrl_get_select_default() in probe
  drivers/net/ethernet/cadence: don't use devm_pinctrl_get_select_default() in probe
  drivers/net/can/c_can: don't use devm_pinctrl_get_select_default() in probe
  net/usb: add relative mii functions for r815x
  net/tipc: use %*phC to dump small buffers in hex form
  qlcnic: Adding Maintainers.
  gre: Fix MTU sizing check for gretap tunnels
  pkt_sched: sch_qfq: remove forward declaration of qfq_update_agg_ts
  pkt_sched: sch_qfq: improve efficiency of make_eligible
  gso: Update tunnel segmentation to support Tx checksum offload
  inet: fix spacing in assignment
  ifb: fix oops when loading the ifb failed
  ...
2013-07-13 17:42:22 -07:00
Linus Torvalds
19d2f8e0fb Merge tag 'for-linus-3.11-merge-window-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull second round of 9p patches from Eric Van Hensbergen:
 "Several of these patches were rebased in order to correct style
  issues.  Only stylistic changes were made versus the patches which
  were in linux-next for two weeks.  The rebases have been in linux-next
  for 3 days and have passed my regressions.

  The bulk of these are RDMA fixes and improvements.  There's also some
  additions on the extended attributes front to support some additional
  namespaces and a new option for TCP to force allocation of mount
  requests from a priviledged port"

* tag 'for-linus-3.11-merge-window-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: Remove the unused variable "err" in v9fs_vfs_getattr()
  9P: Add cancelled() to the transport functions.
  9P/RDMA: count posted buffers without a pending request
  9P/RDMA: Improve error handling in rdma_request
  9P/RDMA: Do not free req->rc in error handling in rdma_request()
  9P/RDMA: Use a semaphore to protect the RQ
  9P/RDMA: Protect against duplicate replies
  9P/RDMA: increase P9_RDMA_MAXSIZE to 1MB
  9pnet: refactor struct p9_fcall alloc code
  9P/RDMA: rdma_request() needs not allocate req->rc
  9P: Fix fcall allocation for rdma
  fs/9p: xattr: add trusted and security namespaces
  net/9p: add privport option to 9p tcp transport
2013-07-11 10:21:23 -07:00
Eliezer Tamir
64b0dc517e net: rename busy poll socket op and globals
Rename LL_SO to BUSY_POLL_SO
Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll}
Fix up users of these variables.
Fix documentation for sysctl.

a patch for the socket.7  man page will follow separately,
because of limitations of my mail setup.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir
8b80cda536 net: rename ll methods to busy-poll
Rename ndo_ll_poll to ndo_busy_poll.
Rename sk_mark_ll to sk_mark_napi_id.
Rename skb_mark_ll to skb_mark_napi_id.
Correct all useres of these functions.
Update comments and defines  in include/net/busy_poll.h

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir
076bb0c82a net: rename include/net/ll_poll.h to include/net/busy_poll.h
Rename the file and correct all the places where it is included.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir
76b1e9b981 net/fs: change busy poll time accounting
Suggested by Linus:
Changed time accounting for busy-poll:
- Make it microsecond based.
- Use unsigned longs.
- Revert back to use time_after instead of time_in_range.
Reorder poll/select busy loop conditions:
- Clear busy_flag after one time we can't busy-poll.
- Only init busy_end if we actually are going to busy-poll.
Added one more missing need_resched() test.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-09 12:33:04 -07:00
Eliezer Tamir
cbf55001b2 net: rename low latency sockets functions to busy poll
Rename functions in include/net/ll_poll.h to busy wait.
Clarify documentation about expected power use increase.
Rename POLL_LL to POLL_BUSY_LOOP.
Add need_resched() testing to poll/select busy loops.

Note, that in select and poll can_busy_poll is dynamic and is
updated continuously to reflect the existence of supported
sockets with valid queue information.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-08 19:25:45 -07:00
Simon Derr
80b45261a0 9P: Add cancelled() to the transport functions.
RDMA needs to post a buffer for each incoming reply.
Hence it needs to keep count of these and needs to be
aware of whether a flushed request has received a reply
or not.

This patch adds the cancelled() callback to the transport modules.
It is called when RFLUSH has been received and that the corresponding
request will never receive a reply.

Signed-off-by: Simon Derr <simon.derr@bull.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-07-07 22:18:18 -05:00
Jim Garlick
2f28c8b31d net/9p: add privport option to 9p tcp transport
If the privport option is specified, the tcp transport binds local
address to a reserved port before connecting to the 9p server.

In some cases when 9P AUTH cannot be implemented, this is better than
nothing.

Signed-off-by: Jim Garlick <garlick@llnl.gov>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-07-07 21:59:54 -05:00
Eric Dumazet
9eb5bf838d net: sock: fix TCP_SKB_MIN_TRUESIZE
commit eea86af6b1 ("net: sock: adapt SOCK_MIN_RCVBUF and
SOCK_MIN_SNDBUF") forgot the sk_buff alignment taken into account
in __alloc_skb() : skb->truesize = SKB_TRUESIZE(size);

While above commit fixed the sender issue, the receiver is still
dropping the second packet (on loopback device), because the receiver
socket can not really hold two skbs :
First packet truesize already is above sk_rcvbuf, so even TCP coalescing
cannot help.

On a typical 64bit build, each tcp skb truesize is 2304, instead of 2272

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 16:52:10 -07:00
Daniel Borkmann
c50cd35788 net: gre: move GSO functions to gre_offload
Similarly to TCP/UDP offloading, move all related GRE functions to
gre_offload.c to make things more explicit and similar to the rest
of the code.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:37:39 -07:00
Eliezer Tamir
419076f59f net: lls fix build with allnoconfig
correct placeholder declarations to prevent build breakage when
!CONFIG_NET_LL_RX_POLL

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 23:50:47 -07:00
Eliezer Tamir
1bc2774d86 net: convert lls to use time_in_range()
Time in range will fail safely if we move to a different cpu with an
extremely large clock skew.
Add time_in_range64() and convert lls to use it.

changelog:
v2
- fixed double call to sched_clock in can_poll_ll
- fixed checkpatchisms

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 15:53:53 -07:00