asmadeus/linux.git - The linux kernel

Age	Commit message (Collapse)	Author
2012-07-19	ipv4: tcp: remove per net tcp_sock	Eric Dumazet
	tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19	ipv4: use seqlock for nh_exceptions	Julian Anastasov
	Use global seqlock for the nh_exceptions. Call fnhe_oldest with the right hash chain. Correct the diff value for dst_set_expires. v2: after suggestions from Eric Dumazet: * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe * continue daddr search in rt_bind_exception v3: * remove the daddr check before seqlock in rt_bind_exception * restart lookup in rt_bind_exception on detected seqlock change, as suggested by David Miller Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-18	Merge branch 'master' of ↵	John W. Linville
	git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
2012-07-18	ipv6: add ipv6_addr_hash() helper	Eric Dumazet
	Introduce ipv6_addr_hash() helper doing a XOR on all bits of an IPv6 address, with an optimized x86_64 version. Use it in flow dissector, as suggested by Andrew McGregor, to reduce hash collision probabilities in fq_codel (and other users of flow dissector) Use it in ip6_tunnel.c and use more bit shuffling, as suggested by David Laight, as existing hash was ignoring most of them. Use it in sunrpc and use more bit shuffling, using hash_32(). Use it in net/ipv6/addrconf.c, using hash_32() as well. As a cleanup, use it in net/ipv4/tcp_metrics.c Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew McGregor <andrewmcgr@gmail.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-18	net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.	Saurabh
	Incorporated David and Steffen's comments. Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module. Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reviewed-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-18	ipv6: fix inet6_csk_xmit()	Eric Dumazet
	We should provide to inet6_csk_route_socket a struct flowi6 pointer, so that net6_csk_xmit() works correctly instead of sending garbage. Also add some consts Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17	Merge branch 'for-john' of ↵	John W. Linville
	git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
2012-07-17	Merge branch 'nexthop_exceptions'	David S. Miller
	These patches implement the final mechanism necessary to really allow us to go without the route cache in ipv4. We need a place to have long-term storage of PMTU/redirect information which is independent of the routes themselves, yet does not get us back into a situation where we have to write to metrics or anything like that. For this we use an "next-hop exception" table in the FIB nexthops. The one thing I desperately want to avoid is having to create clone routes in the FIB trie for this purpose, because that is very expensive. However, I'm willing to entertain such an idea later if this current scheme proves to have downsides that the FIB trie variant would not have. In order to accomodate this any such scheme, we need to be able to produce a full flow key at PMTU/redirect time. That required an adjustment of the interface call-sites used to propagate these events. For a PMTU/redirect with a fully specified socket, we pass that socket and use it to produce the flow key. Otherwise we use a passed in SKB to formulate the key. There are two cases that need to be distinguished, ICMP message processing (in which case the IP header is at skb->data) and output packet processing (mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)). We also have to make the code able to handle the case where the dst itself passed into the dst_ops->{update_pmtu,redirect} method is invalidated. This matters for calls from sockets that have cached that route. We provide a inet{,6} helper function for this purpose, and edit SCTP specially since it caches routes at the transport rather than socket level. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17	ipv4: Add FIB nexthop exceptions.	David S. Miller
	In a regime where we have subnetted route entries, we need a way to store persistent storage about destination specific learned values such as redirects and PMTU values. This is implemented here via nexthop exceptions. The initial implementation is a 2048 entry hash table with relaiming starting at chain length 5. A more sophisticated scheme can be devised if that proves necessary. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17	net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()	David S. Miller
	This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17	cfg80211: add cellular base station regulatory hint support	Luis R. Rodriguez
	Cellular base stations can provide hints to cfg80211 about where they think we are. This can be done for example on a cell phone. To enable these hints we simply allow them through as user regulatory hints but we allow userspace to clasify the hint as either coming directly from the user or coming from a cellular base station. This option is only available when you enable CONFIG_CFG80211_CERTIFICATION_ONUS. The base station hints themselves will not be processed by the core unless at least one device on the system supports this feature. Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-17	ipvs: fix oops on NAT reply in br_nf context	Lin Ming
	IPVS should not reset skb->nf_bridge in FORWARD hook by calling nf_reset for NAT replies. It triggers oops in br_nf_forward_finish. [ 579.781508] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 [ 579.781669] IP: [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112 [ 579.781792] PGD 218f9067 PUD 0 [ 579.781865] Oops: 0000 [#1] SMP [ 579.781945] CPU 0 [ 579.781983] Modules linked in: [ 579.782047] [ 579.782080] [ 579.782114] Pid: 4644, comm: qemu Tainted: G W 3.5.0-rc5-00006-g95e69f9 #282 Hewlett-Packard /30E8 [ 579.782300] RIP: 0010:[<ffffffff817b1ca5>] [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112 [ 579.782455] RSP: 0018:ffff88007b003a98 EFLAGS: 00010287 [ 579.782541] RAX: 0000000000000008 RBX: ffff8800762ead00 RCX: 000000000001670a [ 579.782653] RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff8800762ead00 [ 579.782845] RBP: ffff88007b003ac8 R08: 0000000000016630 R09: ffff88007b003a90 [ 579.782957] R10: ffff88007b0038e8 R11: ffff88002da37540 R12: ffff88002da01a02 [ 579.783066] R13: ffff88002da01a80 R14: ffff88002d83c000 R15: ffff88002d82a000 [ 579.783177] FS: 0000000000000000(0000) GS:ffff88007b000000(0063) knlGS:00000000f62d1b70 [ 579.783306] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b [ 579.783395] CR2: 0000000000000004 CR3: 00000000218fe000 CR4: 00000000000027f0 [ 579.783505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 579.783684] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 579.783795] Process qemu (pid: 4644, threadinfo ffff880021b20000, task ffff880021aba760) [ 579.783919] Stack: [ 579.783959] ffff88007693cedc ffff8800762ead00 ffff88002da01a02 ffff8800762ead00 [ 579.784110] ffff88002da01a02 ffff88002da01a80 ffff88007b003b18 ffffffff817b26c7 [ 579.784260] ffff880080000000 ffffffff81ef59f0 ffff8800762ead00 ffffffff81ef58b0 [ 579.784477] Call Trace: [ 579.784523] <IRQ> [ 579.784562] [ 579.784603] [<ffffffff817b26c7>] br_nf_forward_ip+0x275/0x2c8 [ 579.784707] [<ffffffff81704b58>] nf_iterate+0x47/0x7d [ 579.784797] [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae [ 579.784906] [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102 [ 579.784995] [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae [ 579.785175] [<ffffffff8187fa95>] ? _raw_write_unlock_bh+0x19/0x1b [ 579.785179] [<ffffffff817ac417>] __br_forward+0x97/0xa2 [ 579.785179] [<ffffffff817ad366>] br_handle_frame_finish+0x1a6/0x257 [ 579.785179] [<ffffffff817b2386>] br_nf_pre_routing_finish+0x26d/0x2cb [ 579.785179] [<ffffffff817b2cf0>] br_nf_pre_routing+0x55d/0x5c1 [ 579.785179] [<ffffffff81704b58>] nf_iterate+0x47/0x7d [ 579.785179] [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44 [ 579.785179] [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102 [ 579.785179] [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44 [ 579.785179] [<ffffffff81551525>] ? sky2_poll+0xb35/0xb54 [ 579.785179] [<ffffffff817ad62a>] br_handle_frame+0x213/0x229 [ 579.785179] [<ffffffff817ad417>] ? br_handle_frame_finish+0x257/0x257 [ 579.785179] [<ffffffff816e3b47>] __netif_receive_skb+0x2b4/0x3f1 [ 579.785179] [<ffffffff816e69fc>] process_backlog+0x99/0x1e2 [ 579.785179] [<ffffffff816e6800>] net_rx_action+0xdf/0x242 [ 579.785179] [<ffffffff8107e8a8>] __do_softirq+0xc1/0x1e0 [ 579.785179] [<ffffffff8135a5ba>] ? trace_hardirqs_off_thunk+0x3a/0x6c [ 579.785179] [<ffffffff8188812c>] call_softirq+0x1c/0x30 The steps to reproduce as follow, 1. On Host1, setup brige br0(192.168.1.106) 2. Boot a kvm guest(192.168.1.105) on Host1 and start httpd 3. Start IPVS service on Host1 ipvsadm -A -t 192.168.1.106:80 -s rr ipvsadm -a -t 192.168.1.106:80 -r 192.168.1.105:80 -m 4. Run apache benchmark on Host2(192.168.1.101) ab -n 1000 http://192.168.1.106/ ip_vs_reply4 ip_vs_out handle_response ip_vs_notrack nf_reset() { skb->nf_bridge = NULL; } Actually, IPVS wants in this case just to replace nfct with untracked version. So replace the nf_reset(skb) call in ip_vs_notrack() with a nf_conntrack_put(skb->nfct) call. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-07-17	cfg80211: support TX error rate CQM	Thomas Pedersen
	Let the user configure serveral TX error conection quality monitoring parameters: % error rate, survey interval, and # of attempted packets. On exceeding the TX failure rate over the given interval, the driver will send a CQM notify event with the actual TX failure rate and packets attempted. Signed-off-by: Thomas Pedersen <c_tpeder@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-17	tcp: implement RFC 5961 3.2	Eric Dumazet
	Implement the RFC 5691 mitigation against Blind Reset attack using RST bit. Idea is to validate incoming RST sequence, to match RCV.NXT value, instead of previouly accepted window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND) If sequence is in window but not an exact match, send a "challenge ACK", so that the other part can resend an RST with the appropriate sequence. Add a new sysctl, tcp_challenge_ack_limit, to limit number of challenge ACK sent per second. Add a new SNMP counter to count number of challenge acks sent. (netstat -s \| grep TCPChallengeACK) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16	net: make sock diag per-namespace	Andrey Vagin
	Before this patch sock_diag works for init_net only and dumps information about sockets from all namespaces. This patch expands sock_diag for all name-spaces. It creates a netlink kernel socket for each netns and filters data during dumping. v2: filter accoding with netns in all places remove an unused variable. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pavel Emelyanov <xemul@parallels.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16	Merge branch 'v3.6-rc7' into tty-next	Greg Kroah-Hartman
	This is to sync up on Linus's branch to get the other tty and core changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16	sctp: Adjust PMTU updates to accomodate route invalidation.	David S. Miller
	This adjusts the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16	ipv6: Add helper inet6_csk_update_pmtu().	David S. Miller
	This is the ipv6 version of inet_csk_update_pmtu(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16	ipv4: Add helper inet_csk_update_pmtu().	David S. Miller
	This abstracts away the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). So we try to rebuild the socket cached route after the method invocation if necessary. This isn't used by SCTP because it needs to cache dsts per-transport, and thus will need it's own local version of this helper. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-15	Bluetooth: Use tx window from config response for ack timing	Mat Martineau
	This change addresses an L2CAP ERTM throughput problem when a remote device does not fully utilize the available transmit window. The L2CAP ERTM transmit window size determines the maximum number of unacked frames that may be outstanding at any time. It is configured separately for each direction of an ERTM connection. Each side sends a configuration request with a tx_win field indicating how many unacked frames it is capable of receiving before sending an ack. The configuration response's tx_win field shows how many frames the transmitter will actually send before waiting for an ack. It's important to trace both the actual transmit window (to check for validity of incoming frames) and the number of frames that the transmitter will send before waiting (to send acks at the appropriate time). Now there are separate tx_win and ack_win values. ack_win is updated based on configuration responses, and is used to determine when acks are sent. Signed-off-by: Mat Martineau <mathewm@codeaurora.org> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
2012-07-13	Merge branch 'for-davem' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John Linville says: ==================== Several drivers see updates: mwifiex, ath9k, iwlwifi, brcmsmac, wlcore/wl12xx/wl18xx, and a handful of others. The bcma bus got a lot of attention from Hauke Mehrtens. The cfg80211 component gets a flurry of patches for multi-channel support, and the mac80211 component gets the first few VHT (11ac) and 60GHz (11ad) patches. This also includes the removal of the iwmc3200 drivers, since the hardware never became available to normal people. Additionally, the NFC subsystem gets a series of updates. According to Samuel, "Here are the interesting bits: - A better error management for the HCI stack. - An LLCP "late" binding implementation for a better NFC SAP usage. SAPs are now reserved only when there's a client for it. - Support for Sony RC-S360 (a.k.a. PaSoRi) pn533 based dongle. We can read and write NFC tags and also establish a p2p link with this dongle now. - A few LLCP fixes." Finally, this includes another pull of the fixes from the wireless tree in order to resolve some merge issues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-13	ipv4: Don't store a rule pointer in fib_result.	David S. Miller
	We only use it to fetch the rule's tclassid, so just store the tclassid there instead. This also decreases the size of fib_result by a full 8 bytes on 64-bit. On 32-bits it's a wash. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-13	cfg80211: reduce monitor interface tracking	Johannes Berg
	Revert commit b78e8ceac23655e1e06b30aa95ab11742d1ac7c0 ("cfg80211: track monitor channel") and remove the set_monitor_enabled() callback. Due to the tracking happening in NETDEV_PRE_UP, it had introduced bugs because the monitor interface callback would be called before the device was started. It looks like there's no way to fix this, and using NETDEV_PRE_UP is broken anyway (since there's no NETDEV_UP_FAIL), so remove all that code, track interfaces in NETDEV_UP and also stop tracking the monitor channel in cfg80211. This mostly reverts to before the tracking, except that we keep the interface count tracking so that setting the monitor channel can be rejected properly. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-13	cfg80211/mac80211: re-add get_channel operation	Johannes Berg
	This essentially reverts commit 2e165b818456 but introduces the get_channel operation with a new wireless_dev argument so that you can retrieve the channel per interface. This is necessary as even though we can track all interface channels (except monitor) we can't track the channel type used. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-12	Merge branch 'for-john' of ↵	John W. Linville
	git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Conflicts: drivers/net/wireless/iwmc3200wifi/cfg80211.c drivers/net/wireless/mwifiex/cfg80211.c
2012-07-12	Merge branch 'master' of ↵	John W. Linville
	git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2012-07-12	ipv4: Remove tb_peers from fib_table.	David S. Miller
	No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12	mac80211: add time synchronisation with BSS for assoc	Johannes Berg
	Some drivers (iwlegacy, iwlwifi and rt2x00) today use the bss_conf.last_tsf value. By itself though that value is completely worthless since it may be ancient. What really is needed is synchronisation between some device time and the TSF. To clarify this, rename bss_conf.last_tsf to sync_tsf and add sync_device_ts which is obtained from rx_status which gets a new field device_timestamp for this purpose. This is intentionally not using the mactime field since that is used for other things and in IBSS is expected to sync with the IBSS's TSF which isn't necessarily true for the device timestamp. Also, since we have the information and it's useful even before the connection has been established, give all the timing details to the driver before authenticating. Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-12	mac80211: optimize ieee80211_rx_status struct layout	Johannes Berg
	We waste a lot of space in this struct because it uses int values where smaller ones would be sufficient. The upcoming A-MPDU information needs some space, optimize the struct now. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-12	nl80211: move scan API to wdev	Johannes Berg
	The new P2P Device will have to be able to scan for P2P search, so move scanning to use struct wireless_dev instead of struct net_device. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-12	cfg80211: use wireless_dev for interface management	Johannes Berg
	In order to be able to create P2P Device wdevs, move the virtual interface management over to wireless_dev structures. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2012-07-12	ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect().	David S. Miller
	And delete rt6_redirect(), since it is no longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12	ipv6: Add redirect support to all protocol icmp error handlers.	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12	ipv6: Add ip6_redirect() and ip6_sk_redirect() helper functions.	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	ipv6: Move bulk of redirect handling into rt6_redirect().	David S. Miller
	This sets things up so that we can have the protocol error handlers call down into the ipv6 route code for redirects just as ipv4 already does. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	ipv6: Export ndisc option parsing from ndisc.c	David S. Miller
	This is going to be used internally by the rt6 redirect code. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	ipv4: Kill ip_rt_redirect().	David S. Miller
	No longer needed, as the protocol handlers now all properly propagate the redirect back into the routing code. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions.	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect.	David S. Miller
	All of the redirect acceptance policy is now contained within. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	ipv4: Rearrange arguments to ip_rt_redirect()	David S. Miller
	Pass in the SKB rather than just the IP addresses, so that policy and other aspects can reside in ip_rt_redirect() rather then icmp_redirect(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	tcp: TCP Small Queues	Eric Dumazet
	This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11	Bluetooth: debug: Print l2cap_chan refcount	Andrei Emeltchenko
	Improve debug output. Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
2012-07-10	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller
	Conflicts: net/batman-adv/bridge_loop_avoidance.c net/batman-adv/bridge_loop_avoidance.h net/batman-adv/soft-interface.c net/mac80211/mlme.c With merge help from Antonio Quartulli (batman-adv) and Stephen Rothwell (drivers/net/usb/qmi_wwan.c). The net/mac80211/mlme.c conflict seemed easy enough, accounting for a conversion to some new tracing macros. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10	ipv6: optimize ipv6 addresses compares	Eric Dumazet
	On 64 bit arches having efficient unaligned accesses (eg x86_64) we can use long words to reduce number of instructions for free. Joe Perches suggested to change ipv6_masked_addr_cmp() to return a bool instead of 'int', to make sure ipv6_masked_addr_cmp() cannot be used in a sorting function. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10	ipv4: Remove inetpeer from routes.	David S. Miller
	No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10	ipv4: Maintain redirect and PMTU info in struct rtable again.	David S. Miller
	Maintaining this in the inetpeer entries was not the right way to do this at all. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10	inet: Kill FLOWI_FLAG_PRECOW_METRICS.	David S. Miller
	No longer needed. TCP writes metrics, but now in it's own special cache that does not dirty the route metrics. Therefore there is no longer any reason to pre-cow metrics in this way. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10	inet: Remove ->get_peer() method.	David S. Miller
	No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10	tcp: Move timestamps from inetpeer to metrics cache.	David S. Miller
	With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10	net: Kill set_dst_metric_rtt().	David S. Miller
	No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>