summaryrefslogtreecommitdiffstats
path: root/net/ipv4/af_inet.c
AgeCommit message (Collapse)Author
2012-07-19net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)Yuchung Cheng
sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2) and write(2). The application should replace connect() with it to send data in the opening SYN packet. For blocking socket, sendmsg() blocks until all the data are buffered locally and the handshake is completed like connect() call. It returns similar errno like connect() if the TCP handshake fails. For non-blocking socket, it returns the number of bytes queued (and transmitted in the SYN-data packet) if cookie is available. If cookie is not available, it transmits a data-less SYN packet with Fast Open cookie request option and returns -EINPROGRESS like connect(). Using MSG_FASTOPEN on connecting or connected socket will result in simlar errno like repeating connect() calls. Therefore the application should only use this flag on new sockets. The buffer size of sendmsg() is independent of the MSS of the connection. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19net-tcp: Fast Open client - sending SYN-dataYuchung Cheng
This patch implements sending SYN-data in tcp_connect(). The data is from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch). The length of the cookie in tcp_fastopen_req, init'd to 0, controls the type of the SYN. If the cookie is not cached (len==0), the host sends data-less SYN with Fast Open cookie request option to solicit a cookie from the remote. If cookie is not available (len > 0), the host sends a SYN-data with Fast Open cookie option. If cookie length is negative, the SYN will not include any Fast Open option (for fall back operations). To deal with middleboxes that may drop SYN with data or experimental TCP option, the SYN-data is only sent once. SYN retransmits do not include data or Fast Open options. The connection will fall back to regular TCP handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-19ipv4: Early TCP socket demux.David S. Miller
Input packet processing for local sockets involves two major demuxes. One for the route and one for the socket. But we can optimize this down to one demux for certain kinds of local sockets. Currently we only do this for established TCP sockets, but it could at least in theory be expanded to other kinds of connections. If a TCP socket is established then it's identity is fully specified. This means that whatever input route was used during the three-way handshake must work equally well for the rest of the connection since the keys will not change. Once we move to established state, we cache the receive packet's input route to use later. Like the existing cached route in sk->sk_dst_cache used for output packets, we have to check for route invalidations using dst->obsolete and dst->ops->check(). Early demux occurs outside of a socket locked section, so when a route invalidation occurs we defer the fixup of sk->sk_rx_dst until we are actually inside of established state packet processing and thus have the socket locked. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-19inet: Sanitize inet{,6} protocol demux.David S. Miller
Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay are just a straight arrays. Remove all unnecessary hash masking. Document MAX_INET_PROTOS. Use RAW_HTABLE_SIZE when appropriate. Reported-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-04net: Remove casts to same typeJoe Perches
Adding casts of objects to the same type is unnecessary and confusing for a human reader. For example, this cast: int y; int *p = (int *)&y; I used the coccinelle script below to find and remove these unnecessary casts. I manually removed the conversions this script produces of casts with __force and __user. @@ type T; T *p; @@ - (T *)p + p Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21sock: Introduce named constants for sk_reusePavel Emelyanov
Name them in a "backward compatible" manner, i.e. reuse or not are still 1 and 0 respectively. The reuse value of 2 means that the socket with it will forcibly reuse everyone else's port. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15ipv4: fix checkpatch errorsDaniel Baluta
Fix checkpatch errors of the following type: * ERROR: "foo * bar" should be "foo *bar" * ERROR: "(foo*)" should be "(foo *)" Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-28Remove all #inclusions of asm/system.hDavid Howells
Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *` Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-12net: ipv4: Standardize prefixes for message loggingJoe Perches
Add #define pr_fmt(fmt) as appropriate. Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files. Standardize on "UDPLite: " for appropriate uses. Some prefixes were previously "UDPLITE: " and "UDP-Lite: ". Add KBUILD_MODNAME ": " to icmp and gre. Remove embedded prefixes as appropriate. Add missing "\n" to pr_info in gre.c. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-11net: Convert printks to pr_<level>Joe Perches
Use a more current kernel messaging style. Convert a printk block to print_hex_dump. Coalesce formats, align arguments. Use %s, __func__ instead of embedding function names. Some messages that were prefixed with <foo>_close are now prefixed with <foo>_fini. Some ah4 and esp messages are now not prefixed with "ip ". The intent of this patch is to later add something like #define pr_fmt(fmt) "IPv4: " fmt. to standardize the output messages. Text size is trivially reduced. (x86-32 allyesconfig) $ size net/ipv4/built-in.o* text data bss dec hex filename 887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new 887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-13net: implement IP_RECVTOS for IP_PKTOPTIONSJiri Benc
Currently, it is not easily possible to get TOS/DSCP value of packets from an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt with IP_RECVTOS set, the same way as incoming TTL can be queried. This is not actually implemented for TOS, though. This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6 (IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of the TOS field is stored from the other party's ACK. This is needed for proxies which require DSCP transparency. One such example is at http://zph.bratcheda.org/. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12per-netns ipv4 sysctl_tcp_memGlauber Costa
This patch allows each namespace to independently set up its levels for tcp memory pressure thresholds. This patch alone does not buy much: we need to make this values per group of process somehow. This is achieved in the patches that follows in this patchset. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16net: introduce and use netdev_features_t for device features setsMichał Mirosław
v2: add couple missing conversions in drivers split unexporting netdev_fix_features() implemented %pNF convert sock::sk_route_(no?)caps Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09ipv4: reduce percpu needs for icmpmsg mibsEric Dumazet
Reading /proc/net/snmp on a machine with a lot of cpus is very expensive (can be ~88000 us). This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values for all possible cpus can read 16 Mbytes of memory. ICMP messages are not considered as fast path on a typical server, and eventually few cpus handle them anyway. We can afford an atomic operation instead of using percpu data. This saves 4096 bytes per cpu and per network namespace. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19ipv4: compat_ioctl is local to af_inet.c, make it staticGerrit Renker
ipv4: compat_ioctl is local to af_inet.c, make it static Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-30net: ipv4: relax AF_INET check in bind()Eric Dumazet
commit d0733d2e29b65 (Check for mistakenly passed in non-IPv4 address) added regression on legacy apps that use bind() with AF_UNSPEC family. Relax the check, but make sure the bind() is done on INADDR_ANY addresses, as AF_UNSPEC has probably no sane meaning for other addresses. Bugzilla reference : https://bugzilla.kernel.org/show_bug.cgi?id=42012 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-and-bisected-by: Rene Meier <r_meier@freenet.de> CC: Marcus Meissner <meissner@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-05Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2011-07-04net: bind() fix error return on wrong address familyMarcus Meissner
Hi, Reinhard Max also pointed out that the error should EAFNOSUPPORT according to POSIX. The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN. Other protocols error values in their af bind() methods in current mainline git as far as a brief look shows: EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25, No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip Ciao, Marcus Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-20Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rxon.c drivers/net/wireless/rtlwifi/pci.c net/netfilter/ipvs/ip_vs_core.c
2011-06-17net: rfs: enable RFS before first data packet is receivedEric Dumazet
Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit : > From: Ben Hutchings <bhutchings@solarflare.com> > Date: Fri, 17 Jun 2011 00:50:46 +0100 > > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote: > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) > >> goto discard; > >> > >> if (nsk != sk) { > >> + sock_rps_save_rxhash(nsk, skb->rxhash); > >> if (tcp_child_process(sk, nsk, skb)) { > >> rsk = nsk; > >> goto reset; > >> > > > > I haven't tried this, but it looks reasonable to me. > > > > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar. > > Indeed ipv6 side needs the same fix. > > Eric please add that part and resubmit. And in fact I might stick > this into net-2.6 instead of net-next-2.6 > OK, here is the net-2.6 based one then, thanks ! [PATCH v2] net: rfs: enable RFS before first data packet is received First packet received on a passive tcp flow is not correctly RFS steered. One sock_rps_record_flow() call is missing in inet_accept() But before that, we also must record rxhash when child socket is setup. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@conan.davemloft.net>
2011-06-11snmp: reduce percpu needs by 50%Eric Dumazet
SNMP mibs use two percpu arrays, one used in BH context, another in USER context. With increasing number of cpus in machines, and fact that ipv6 uses per network device ipstats_mib, this is consuming a lot of memory if many network devices are registered. commit be281e554e2a (ipv6: reduce per device ICMP mib sizes) shrinked percpu needs for ipv6, but we can reduce memory use a bit more. With recent percpu infrastructure (irqsafe_cpu_inc() ...), we no longer need this BH/USER separation since we can update counters in a single x86 instruction, regardless of the BH/USER context. Other arches than x86 might need to disable irq in their irqsafe_cpu_inc() implementation : If this happens to be a problem, we can make SNMP_ARRAY_SZ arch dependent, but a previous poll ( https://lkml.org/lkml/2011/3/17/174 ) to arch maintainers did not raise strong opposition. Only on 32bit arches, we need to disable BH for 64bit counters updates done from USER context (currently used for IP MIB) This also reduces vmlinux size : 1) x86_64 build $ size vmlinux.before vmlinux.after text data bss dec hex filename 7853650 1293772 1896448 11043870 a8841e vmlinux.before 7850578 1293772 1896448 11040798 a8781e vmlinux.after 2) i386 build $ size vmlinux.before vmlinux.afterpatch text data bss dec hex filename 6039335 635076 3670016 10344427 9dd7eb vmlinux.before 6037342 635076 3670016 10342434 9dd022 vmlinux.afterpatch Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andi Kleen <andi@firstfloor.org> CC: Ingo Molnar <mingo@elte.hu> CC: Tejun Heo <tj@kernel.org> CC: Christoph Lameter <cl@linux-foundation.org> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org CC: linux-arch@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-01net/ipv4: Check for mistakenly passed in non-IPv4 addressMarcus Meissner
Check against mistakenly passing in IPv6 addresses (which would result in an INADDR_ANY bind) or similar incompatible sockaddrs. Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-13net: ipv4: add IPPROTO_ICMP socket kindVasiliy Kulikov
This patch adds IPPROTO_ICMP socket kind. It makes it possible to send ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages without any special privileges. In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In order not to increase the kernel's attack surface, the new functionality is disabled by default, but is enabled at bootup by supporting Linux distributions, optionally with restriction to a group or a group range (see below). Similar functionality is implemented in Mac OS X: http://www.manpagez.com/man/4/icmp/ A new ping socket is created with socket(PF_INET, SOCK_DGRAM, PROT_ICMP) Message identifiers (octets 4-5 of ICMP header) are interpreted as local ports. Addresses are stored in struct sockaddr_in. No port numbers are reserved for privileged processes, port 0 is reserved for API ("let the kernel pick a free number"). There is no notion of remote ports, remote port numbers provided by the user (e.g. in connect()) are ignored. Data sent and received include ICMP headers. This is deliberate to: 1) Avoid the need to transport headers values like sequence numbers by other means. 2) Make it easier to port existing programs using raw sockets. ICMP headers given to send() are checked and sanitized. The type must be ICMP_ECHO and the code must be zero (future extensions might relax this, see below). The id is set to the number (local port) of the socket, the checksum is always recomputed. ICMP reply packets received from the network are demultiplexed according to their id's, and are returned by recv() without any modifications. IP header information and ICMP errors of those packets may be obtained via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source quenches and redirects are reported as fake errors via the error queue (IP_RECVERR); the next hop address for redirects is saved to ee_info (in network order). socket(2) is restricted to the group range specified in "/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning that nobody (not even root) may create ping sockets. Setting it to "100 100" would grant permissions to the single group (to either make /sbin/ping g+s and owned by this group or to grant permissions to the "netadmins" group), "0 4294967295" would enable it for the world, "100 4294967295" would enable it for the users, but not daemons. The existing code might be (in the unlikely case anyone needs it) extended rather easily to handle other similar pairs of ICMP messages (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply etc.). Userspace ping util & patch for it: http://openwall.info/wiki/people/segoon/ping For Openwall GNU/*/Linux it was the last step on the road to the setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels) is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs: http://mirrors.kernel.org/openwall/Owl/current/iso/ Initially this functionality was written by Pavel Kankovsky for Linux 2.4.32, but unfortunately it was never made public. All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with the patch. PATCH v3: - switched to flowi4. - minor changes to be consistent with raw sockets code. PATCH v2: - changed ping_debug() to pr_debug(). - removed CONFIG_IP_PING. - removed ping_seq_fops.owner field (unused for procfs). - switched to proc_net_fops_create(). - switched to %pK in seq_printf(). PATCH v1: - fixed checksumming bug. - CAP_NET_RAW may not create icmp sockets anymore. RFC v2: - minor cleanups. - introduced sysctl'able group range to restrict socket(2). Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08ipv4: Use cork flow in inet_sk_{reselect_saddr,rebuild_header}()David S. Miller
These two functions must be invoked only when the socket is locked (because socket identity modifications are made non-atomically). Therefore we can use the cork flow for output route lookups. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-03ipv4: Make caller provide on-stack flow key to ip_route_output_ports().David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28ipv4: Fetch route saddr from flow key in inet_sk_reselect_saddr().David S. Miller
Now that output route lookups update the flow with source address selection, we can fetch it from fl4->saddr instead of rt->rt_src Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28inet: add RCU protection to inet->optEric Dumazet
We lack proper synchronization to manipulate inet->opt ip_options Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt. Another thread can change inet->opt pointer and free old one under us. Use RCU to protect inet->opt (changed to inet->inet_opt). Instead of handling atomic refcounts, just copy ip_options when necessary, to avoid cache line dirtying. We cant insert an rcu_head in struct ip_options since its included in skb->cb[], so this patch is large because I had to introduce a new ip_options_rcu structure. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-27ipv4: Sanitize and simplify ip_route_{connect,newports}()David S. Miller
These functions are used together as a unit for route resolution during connect(). They address the chicken-and-egg problem that exists when ports need to be allocated during connect() processing, yet such port allocations require addressing information from the routing code. It's currently more heavy handed than it needs to be, and in particular we allocate and initialize a flow object twice. Let the callers provide the on-stack flow object. That way we only need to initialize it once in the ip_route_connect() call. Later, if ip_route_newports() needs to do anything, it re-uses that flow object as-is except for the ports which it updates before the route re-lookup. Also, describe why this set of facilities are needed and how it works in a big comment. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
2011-04-22inet: constify ip headers and in6_addrEric Dumazet
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Create and use route lookup helpers.David S. Miller
The idea here is this minimizes the number of places one has to edit in order to make changes to how flows are defined and used. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02ipv4: Make output route lookup return rtable directly.David S. Miller
Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Kill can_sleep arg to ip_route_output_flow()David S. Miller
This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"David S. Miller
Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Can final ip_route_connect() arg to boolean "can_sleep".David S. Miller
Since that's what the current vague "flags" thing means. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-31Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2011-01-30net: Add compat ioctl support for the ipv4 multicast ioctl SIOCGETSGCNTEric W. Biederman
SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1, which unfortunately means the existing infrastructure for compat networking ioctls is insufficient. A trivial compact ioctl implementation would conflict with: SIOCAX25ADDUID SIOCAIPXPRISLT SIOCGETSGCNT_IN6 SIOCGETSGCNT SIOCRSSCAUSE SIOCX25SSUBSCRIP SIOCX25SDTEFACILITIES To make this work I have updated the compat_ioctl decode path to mirror the the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl function into struct proto so I can break out ioctls by which kind of ip socket I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal ipmr_ioctl. This was necessary because unfortunately the struct layout for the SIOCGETSGCNT has unsigned longs in it so changes between 32bit and 64bit kernels. This change was sufficient to run a 32bit ip multicast routing daemon on a 64bit kernel. Reported-by: Bill Fenner <fenner@aristanetworks.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24net: change netdev->features to u32Michał Mirosław
Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-17net: use the macros defined for the members of flowiChangli Gao
Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-20net: build_ehash_secret() and rt_bind_peer() cleanupsEric Dumazet
Now cmpxchg() is available on all arches, we can use it in build_ehash_secret() and rt_bind_peer() instead of using spinlocks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() ↵Changli Gao
and inet_sendpage() a new boolean flag no_autobind is added to structure proto to avoid the autobind calls when the protocol is TCP. Then sock_rps_record_flow() is called int the TCP's sendmsg() and sendpage() pathes. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/inet_common.h | 4 ++++ include/net/sock.h | 1 + include/net/tcp.h | 8 ++++---- net/ipv4/af_inet.c | 15 +++++++++------ net/ipv4/tcp.c | 11 +++++------ net/ipv4/tcp_ipv4.c | 3 +++ net/ipv6/af_inet6.c | 8 ++++---- net/ipv6/tcp_ipv6.c | 3 +++ 8 files changed, 33 insertions(+), 20 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-30snmp: 64bit ipstats_mib for all archesEric Dumazet
/proc/net/snmp and /proc/net/netstat expose SNMP counters. Width of these counters is either 32 or 64 bits, depending on the size of "unsigned long" in kernel. This means user program parsing these files must already be prepared to deal with 64bit values, regardless of user program being 32 or 64 bit. This patch introduces 64bit snmp values for IPSTAT mib, where some counters can wrap pretty fast if they are 32bit wide. # netstat -s|egrep "InOctets|OutOctets" InOctets: 244068329096 OutOctets: 244069348848 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25snmp: add align parameter to snmp_mib_init()Eric Dumazet
In preparation for 64bit snmp counters for some mibs, add an 'align' parameter to snmp_mib_init(), instead of assuming mibs only contain 'unsigned long' fields. Callers can use __alignof__(type) to provide correct alignment. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-23net - IP_NODEFRAG option for IPv4 socketJiri Olsa
this patch is implementing IP_NODEFRAG option for IPv4 socket. The reason is, there's no other way to send out the packet with user customized header of the reassembly part. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-10net-next: remove useless union keywordChangli Gao
remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-15net: reserve ports for applications using fixed port numbersAmerigo Wang
(Dropped the infiniband part, because Tetsuo modified the related code, I will send a separate patch for it once this is accepted.) This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which allows users to reserve ports for third-party applications. The reserved ports will not be used by automatic port assignments (e.g. when calling connect() or bind() with port number 0). Explicit port allocation behavior is unchanged. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: Make RFS socket operations not be inet specific.David S. Miller
Idea from Eric Dumazet. As for placement inside of struct sock, I tried to choose a place that otherwise has a 32-bit hole on 64-bit systems. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-04-20net: Fix various endianness glitchesEric Dumazet
Sparse can help us find endianness bugs, but we need to make some cleanups to be able to more easily spot real bugs. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20net: sk_sleep() helperEric Dumazet
Define a new function to return the waitqueue of a "struct sock". static inline wait_queue_head_t *sk_sleep(struct sock *sk) { return sk->sk_sleep; } Change all read occurrences of sk_sleep by a call to this function. Needed for a future RCU conversion. sk_sleep wont be a field directly available. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16rfs: Receive Flow SteeringTom Herbert
This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table. And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true: - The current CPU is unset (equal to RPS_NO_CPU) - Current CPU is offline - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery. Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant. This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols. There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two. The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency). The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf. e1000e on 8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU RPC test tps CPU% 50/90/99% usec latency Latency StdDev No RFS/RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13net: sk_dst_cache RCUificationEric Dumazet
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this work. sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock) This rwlock is readlocked for a very small amount of time, and dst entries are already freed after RCU grace period. This calls for RCU again :) This patch converts sk_dst_lock to a spinlock, and use RCU for readers. __sk_dst_get() is supposed to be called with rcu_read_lock() or if socket locked by user, so use appropriate rcu_dereference_check() condition (rcu_read_lock_held() || sock_owned_by_user(sk)) This patch avoids two atomic ops per tx packet on UDP connected sockets, for example, and permits sk_dst_lock to be much less dirtied. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>