summaryrefslogtreecommitdiffstats
path: root/include/net
AgeCommit message (Collapse)Author
2012-07-11ipv6: Move bulk of redirect handling into rt6_redirect().David S. Miller
This sets things up so that we can have the protocol error handlers call down into the ipv6 route code for redirects just as ipv4 already does. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv6: Export ndisc option parsing from ndisc.cDavid S. Miller
This is going to be used internally by the rt6 redirect code. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv4: Kill ip_rt_redirect().David S. Miller
No longer needed, as the protocol handlers now all properly propagate the redirect back into the routing code. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect.David S. Miller
All of the redirect acceptance policy is now contained within. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv4: Rearrange arguments to ip_rt_redirect()David S. Miller
Pass in the SKB rather than just the IP addresses, so that policy and other aspects can reside in ip_rt_redirect() rather then icmp_redirect(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11tcp: TCP Small QueuesEric Dumazet
This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/batman-adv/bridge_loop_avoidance.c net/batman-adv/bridge_loop_avoidance.h net/batman-adv/soft-interface.c net/mac80211/mlme.c With merge help from Antonio Quartulli (batman-adv) and Stephen Rothwell (drivers/net/usb/qmi_wwan.c). The net/mac80211/mlme.c conflict seemed easy enough, accounting for a conversion to some new tracing macros. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10ipv6: optimize ipv6 addresses comparesEric Dumazet
On 64 bit arches having efficient unaligned accesses (eg x86_64) we can use long words to reduce number of instructions for free. Joe Perches suggested to change ipv6_masked_addr_cmp() to return a bool instead of 'int', to make sure ipv6_masked_addr_cmp() cannot be used in a sorting function. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10ipv4: Remove inetpeer from routes.David S. Miller
No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10ipv4: Maintain redirect and PMTU info in struct rtable again.David S. Miller
Maintaining this in the inetpeer entries was not the right way to do this at all. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10inet: Kill FLOWI_FLAG_PRECOW_METRICS.David S. Miller
No longer needed. TCP writes metrics, but now in it's own special cache that does not dirty the route metrics. Therefore there is no longer any reason to pre-cow metrics in this way. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10inet: Remove ->get_peer() method.David S. Miller
No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10tcp: Move timestamps from inetpeer to metrics cache.David S. Miller
With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10net: Kill set_dst_metric_rtt().David S. Miller
No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10tcp: Maintain dynamic metrics in local cache.David S. Miller
Maintain a local hash table of TCP dynamic metrics blobs. Computed TCP metrics are no longer maintained in the route metrics. The table uses RCU and an extremely simple hash so that it has low latency and low overhead. A simple hash is legitimate because we only make metrics blobs for fully established connections. Some tweaking of the default hash table sizes, metric timeouts, and the hash chain length limit certainly could use some tweaking. But the basic design seems sound. With help from Eric Dumazet and Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10tcp: Abstract back handling peer aliveness test into helper function.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10tcp: Move dynamnic metrics handling into seperate file.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10ipv4: Fix crashes in fib_rules_tclass().David S. Miller
All paths assume, when CONFIG_IP_MULTIPLE_TABLES is enabled, that any successful call to fib_lookup() will initialize the fib_result->r value to something. We violated that expectation in the new fib_lookup() fast path. Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-09netfilter: nf_ct_ecache: fix crash with multiple containers, one shutting downPablo Neira Ayuso
Hans reports that he's still hitting: BUG: unable to handle kernel NULL pointer dereference at 000000000000027c IP: [<ffffffff813615db>] netlink_has_listeners+0xb/0x60 PGD 0 Oops: 0000 [#3] PREEMPT SMP CPU 0 It happens when adding a number of containers with do: nfct_query(h, NFCT_Q_CREATE, ct); and most likely one namespace shuts down. this problem was supposed to be fixed by: 70e9942 netfilter: nf_conntrack: make event callback registration per-netns Still, it was missing one rcu_access_pointer to check if the callback is set or not. Reported-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-07-07Merge branch 'master' of git://1984.lsi.us.es/nf-nextDavid S. Miller
2012-07-05ipv4: Avoid overhead when no custom FIB rules are installed.David S. Miller
If the user hasn't actually installed any custom rules, or fiddled with the default ones, don't go through the whole FIB rules layer. It's just pure overhead. Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check the individual tables by hand, one by one. Also, move fib_num_tclassid_users into the ipv4 network namespace. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2012-07-05net: Kill dst->_neighbour, accessors, and final uses.David S. Miller
No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05ipv6: Store route neighbour in rt6_info struct.David S. Miller
This makes for a simplified conversion away from dst_get_neighbour*(). All code outside of ipv6 will use neigh lookups via dst_neigh_lookup*(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05net: Pass neighbours and dest address into NETEVENT_REDIRECT events.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05decnet: Use neighbours privately in dn_route struct.David S. Miller
This allows an easy conversion away from dst_get_neighbour*(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05net: Add optional SKB arg to dst_ops->neigh_lookup().David S. Miller
Causes the handler to use the daddr in the ipv4/ipv6 header when the route gateway is unspecified (local subnet). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05net: Do delayed neigh confirmation.David S. Miller
When a dst_confirm() happens, mark the confirmation as pending in the dst. Then on the next packet out, when we have the neigh in-hand, do the update. This removes the dependency in dst_confirm() of dst's having an attached neigh. While we're here, remove the explicit 'dst' NULL check, all except 2 or 3 call sites ensure it's not NULL. So just fix those cases up. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05ipv4: Make neigh lookups directly in output packet path.David S. Miller
Do not use the dst cached neigh, we'll be getting rid of that. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-04netfilter: nf_conntrack: generalize nf_ct_l4proto_netPablo Neira Ayuso
This patch generalizes nf_ct_l4proto_net by splitting it into chunks and moving the corresponding protocol part to where it really belongs to. To clarify, note that we follow two different approaches to support per-net depending if it's built-in or run-time loadable protocol tracker. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
2012-06-30sctp: be more restrictive in transport selection on bundled sacksNeil Horman
It was noticed recently that when we send data on a transport, its possible that we might bundle a sack that arrived on a different transport. While this isn't a major problem, it does go against the SHOULD requirement in section 6.4 of RFC 2960: An endpoint SHOULD transmit reply chunks (e.g., SACK, HEARTBEAT ACK, etc.) to the same destination transport address from which it received the DATA or control chunk to which it is replying. This rule should also be followed if the endpoint is bundling DATA chunks together with the reply chunk. This patch seeks to correct that. It restricts the bundling of sack operations to only those transports which have moved the ctsn of the association forward since the last sack. By doing this we guarantee that we only bundle outbound saks on a transport that has received a chunk since the last sack. This brings us into stricter compliance with the RFC. Vlad had initially suggested that we strictly allow only sack bundling on the transport that last moved the ctsn forward. While this makes sense, I was concerned that doing so prevented us from bundling in the case where we had received chunks that moved the ctsn on multiple transports. In those cases, the RFC allows us to select any of the transports having received chunks to bundle the sack on. so I've modified the approach to allow for that, by adding a state variable to each transport that tracks weather it has moved the ctsn since the last sack. This I think keeps our behavior (and performance), close enough to our current profile that I think we can do this without a sysctl knob to enable/disable it. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yaseivch <vyasevich@gmail.com> CC: David S. Miller <davem@davemloft.net> CC: linux-sctp@vger.kernel.org Reported-by: Michele Baldessari <michele@redhat.com> Reported-by: sorin serban <sserban@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-29Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem Conflicts: drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c
2012-06-29ipv4: Elide fib_validate_source() completely when possible.David S. Miller
If rpfilter is off (or the SKB has an IPSEC path) and there are not tclassid users, we don't have to do anything at all when fib_validate_source() is invoked besides setting the itag to zero. We monitor tclassid uses with a counter (modified only under RTNL and marked __read_mostly) and we protect the fib_validate_source() real work with a test against this counter and whether rpfilter is to be done. Having a way to know whether we need no tclassid processing or not also opens the door for future optimized rpfilter algorithms that do not perform full FIB lookups. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-29ipv6_tunnel: Allow receiving packets on the fallback tunnel if they pass ↵Ville Nuorvala
sanity checks At Facebook, we do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap an extra IP header on incoming packets so they can be routed to the backend. In the v4 tunnel driver, when these packets fall on the default tunl0 device, the behavior is to decapsulate them and drop them back on the stack. So our setup is that tunl0 has the VIP and eth0 has (obviously) the backend's real address. In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same behavior - if you didn't have an explicit tunnel setup, it would drop the packet. This patch brings that v4 feature to the v6 driver. The same IPv6 address checks are performed as with any normal tunnel, but as the fallback tunnel endpoint addresses are unspecified, the checks must be performed on a per-packet basis, rather than at tunnel configuration time. [Patch description modified by phil@ipom.com] Signed-off-by: Ville Nuorvala <ville.nuorvala@gmail.com> Tested-by: Phil Dibowitz <phil@ipom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28ipv4: Adjust in_dev handling in fib_validate_source()David S. Miller
Checking for in_dev being NULL is pointless. In fact, all of our callers have in_dev precomputed already, so just pass it in and remove the NULL checking. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28net: Use NLMSG_DEFAULT_SIZE in combination with nlmsg_new()Thomas Graf
Using NLMSG_GOODSIZE results in multiple pages being used as nlmsg_new() will automatically add the size of the netlink header to the payload thus exceeding the page limit. NLMSG_DEFAULT_SIZE takes this into account. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: Jiri Pirko <jpirko@redhat.com> Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Cc: Sergey Lapin <slapin@ossfans.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org> Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org> Cc: Samuel Ortiz <sameo@linux.intel.com> Reviewed-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28tcp: pass fl6 to inet6_csk_route_req()Neal Cardwell
This commit changes inet_csk_route_req() so that it uses a pointer to a struct flowi6, rather than allocating its own on the stack. This brings its behavior in line with its IPv4 cousin, inet_csk_route_req(), and allows a follow-on patch to fix a dst leak. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28ipv4: Kill rt->rt_spec_dst, no longer used.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28ipv4: Create and use fib_compute_spec_dst() helper.David S. Miller
The specific destination is the host we direct unicast replies to. Usually this is the original packet source address, but if we are responding to a multicast or broadcast packet we have to use something different. Specifically we must use the source address we would use if we were to send a packet to the unicast source of the original packet. The routing cache precomputes this value, but we want to remove that precomputation because it creates a hard dependency on the expensive rpfilter source address validation which we'd like to make cheaper. There are only three places where this matters: 1) ICMP replies. 2) pktinfo CMSG 3) IP options Now there will be no real users of rt->rt_spec_dst and we can simply remove it altogether. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28ipv4: Show that ip_send_reply() is purely unicast routine.David S. Miller
Rename it to ip_send_unicast_reply() and add explicit 'saddr' argument. This removed one of the few users of rt->rt_spec_dst. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27ipv4: Kill early demux method return value.David S. Miller
It's completely unnecessary. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27xfrm_user: Propagate netlink error codes properly.David S. Miller
Instead of using a fixed value of "-1" or "-EMSGSIZE", propagate what the nla_*() interfaces actually return. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27Revert "ipv4: tcp: dont cache unconfirmed intput dst"David S. Miller
This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7. This change has several unwanted side effects: 1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll thus never create a real cached route. 2) All TCP traffic will use DST_NOCACHE and never use the routing cache at all. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27ipv4: tcp: dont cache unconfirmed intput dstEric Dumazet
DDOS synflood attacks hit badly IP route cache. On typical machines, this cache is allowed to hold up to 8 Millions dst entries, 256 bytes for each, for a total of 2GB of memory. rt_garbage_collect() triggers and tries to cleanup things. Eventually route cache is disabled but machine is under fire and might OOM and crash. This patch exploits the new TCP early demux, to set a nocache boolean in case incoming TCP frame is for a not yet ESTABLISHED or TIMEWAIT socket. This 'nocache' boolean is then used in case dst entry is not found in route cache, to create an unhashed dst entry (DST_NOCACHE) SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache output dst for syncookies), so after this patch, a machine is able to absorb a DDOS synflood attack without polluting its IP route cache. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27netfilter: nf_conntrack: add nf_ct_kfree_compat_sysctl_tableGao feng
This patch is a cleanup. It adds nf_ct_kfree_compat_sysctl_table to release l4proto's compat sysctl table and set the compat sysctl table point to NULL. This new function will be used by follow-up patches. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-06-27netfilter: nf_conntrack: prepare l4proto->init_net cleanupGao feng
l4proto->init contain quite redundant code. We can simplify this by adding a new parameter l3proto. This patch prepares that code simplification. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-06-26mac802154: add wpan device-class supportalex.bluesman.smirnov@gmail.com
Every real 802.15.4 transceiver, which works with software MAC layer, can be classified as a wpan device in this stack. So the wpan device implementation provides missing link in datapath between the device drivers and the Linux network queue. According to the IEEE 802.15.4 standard each packet can be one of the following types: - beacon - MAC layer command - ACK - data This patch adds support for the data packet-type only, but this is enough to perform data transmission and receiving over radio. Signed-off-by: Alexander Smirnov <alex.bluesman.smirnov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-26Merge branch 'for-john' of git://git.sipsolutions.net/mac80211-nextJohn W. Linville
2012-06-26nl80211: specify RSSI threshold in scheduled scanThomas Pedersen
Support configuring an RSSI threshold in dBm (s32) when requesting scheduled scan, below which a BSS won't be reported by the cfg80211 driver. Signed-off-by: Thomas Pedersen <c_tpeder@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>