summaryrefslogtreecommitdiffstats
path: root/include/net/tcp.h
AgeCommit message (Collapse)Author
2008-07-19tcp: options clean upAdam Langley
This should fix the following bugs: * Connections with MD5 signatures produce invalid packets whenever SACK options are included * MD5 signatures are counted twice in the MSS calculations Behaviour changes: * A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK This is because we can't fit any SACK blocks in a packet with MD5 + TS options. There was discussion about disabling SACK rather than TS in order to fit in better with old, buggy kernels, but that was deemed to be unnecessary. * SYNs with MD5 don't include a TS option See above. Additionally, it removes a bunch of duplicated logic for calculating options, which should help avoid these sort of issues in the future. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-19tcp: Fix MD5 signatures for non-linear skbsAdam Langley
Currently, the MD5 code assumes that the SKBs are linear and, in the case that they aren't, happily goes off and hashes off the end of the SKB and into random memory. Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy Polyakov. Also includes a couple of missed route_caps from Stephen's patch in [2]. [1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2 [2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2 Signed-off-by: Adam Langley <agl@imperialviolet.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18mib: put tcp statistics on struct netPavel Emelyanov
Proc temporary uses stats from init_net. BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: add net to NET_INC_STATS_BHPavel Emelyanov
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16sock: add net to prot->enter_memory_pressure callbackPavel Emelyanov
The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't have where to get the net from. I decided to add a sk argument, not the net itself, only to factor all the required sock_net(sk) calls inside the enter_memory_pressure callback itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: add net to TCP_ADD_STATS_USERPavel Emelyanov
Now we're done with the TCP_XXX_STATS macros. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: add net to TCP_DEC_STATSPavel Emelyanov
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: add net to TCP_INC_STATS_BHPavel Emelyanov
Same as before - the sock is always there to get the net from, but there are also some places with the net already saved on the stack. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: add net to TCP_INC_STATSPavel Emelyanov
Fortunately (almost) all the TCP code has a sock to get the net from :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16tcp: add net to tcp_mib_initPavel Emelyanov
This one sets TCP MIBs after zeroing them, and thus requires the net. The existing single caller can use init_net (temporarily). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: drop unused TCP_XXX_STATS macrosPavel Emelyanov
TCP_INC_STATS_USER and TCP_ADD_STATS_BH are currently unused. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-14net: change proto destroy method to return voidBrian Haley
Change struct proto destroy function pointer to return void. Noticed by Al Viro. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-13Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/smc911x.c
2008-06-12tcp: Revert 'process defer accept as established' changes.David S. Miller
This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38 ("tcp: Fix slab corruption with ipv6 and tcp6fuzz"). This change causes several problems, first reported by Ingo Molnar as a distcc-over-loopback regression where connections were getting stuck. Ilpo Järvinen first spotted the locking problems. The new function added by this code, tcp_defer_accept_check(), only has the child socket locked, yet it is modifying state of the parent listening socket. Fixing that is non-trivial at best, because we can't simply just grab the parent listening socket lock at this point, because it would create an ABBA deadlock. The normal ordering is parent listening socket --> child socket, but this code path would require the reverse lock ordering. Next is a problem noticed by Vitaliy Gusev, he noted: ---------------------------------------- >--- a/net/ipv4/tcp_timer.c >+++ b/net/ipv4/tcp_timer.c >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data) > goto death; > } > >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) { >+ tcp_send_active_reset(sk, GFP_ATOMIC); >+ goto death; Here socket sk is not attached to listening socket's request queue. tcp_done() will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should release this sk) as socket is not DEAD. Therefore socket sk will be lost for freeing. ---------------------------------------- Finally, Alexey Kuznetsov argues that there might not even be any real value or advantage to these new semantics even if we fix all of the bugs: ---------------------------------------- Hiding from accept() sockets with only out-of-order data only is the only thing which is impossible with old approach. Is this really so valuable? My opinion: no, this is nothing but a new loophole to consume memory without control. ---------------------------------------- So revert this thing for now. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-12tcp md5sig: Let the caller pass appropriate key for ↵YOSHIFUJI Hideaki
tcp_v{4,6}_do_calc_md5_hash(). As we do for other socket/timewait-socket specific parameters, let the callers pass appropriate arguments to tcp_v{4,6}_do_calc_md5_hash(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-12tcp md5sig: Share most of hash calcucaltion bits between IPv4 and IPv6.YOSHIFUJI Hideaki
We can share most part of the hash calculation code because the only difference between IPv4 and IPv6 is their pseudo headers. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-12tcp md5sig: Remove redundant protocol argument.YOSHIFUJI Hideaki
Protocol is always TCP, so remove useless protocol argument. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-12tcp md5sig: Share MD5 Signature option parser between IPv4 and IPv6.YOSHIFUJI Hideaki
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-10ipv4: Remove unused declaration from include/net/tcp.h.Rami Rosen
- The tcp_unhash() method in /include/net/tcp.h is no more needed, as the unhash method in tcp_prot structure is now inet_unhash (instead of tcp_unhash in the past); see tcp_prot structure in net/ipv4/tcp_ipv4.c. - So, this patch removes tcp_unhash() declaration from include/net/tcp.h Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16[TCP]: Increase the max_burst threshold from 3 to tp->reordering.John Heffner
This change is necessary to allow cwnd to grow during persistent reordering. Cwnd moderation is applied when in the disorder state and an ack that fills the hole comes in. If the hole was greater than 3 packets, but less than tp->reordering, cwnd will shrink when it should not have. Signed-off-by: John Heffner <jheffner@napa.(none)> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/ehea/ehea_main.c drivers/net/wireless/iwlwifi/Kconfig drivers/net/wireless/rt2x00/rt61pci.c net/ipv4/inet_timewait_sock.c net/ipv6/raw.c net/mac80211/ieee80211_sta.c
2008-04-14[SKB]: __skb_append = __skb_queue_after Gerrit Renker
This expresses __skb_append in terms of __skb_queue_after, exploiting that __skb_append(old, new, list) = __skb_queue_after(list, old, new). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Remove owner from tcp_seq_afinfo.Denis V. Lunev
Move it to tcp_seq_afinfo->seq_fops as should be. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Place file operations directly into tcp_seq_afinfo.Denis V. Lunev
No need to have separate never-used variable. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Move seq_ops from tcp_iter_state to tcp_seq_afinfo.Denis V. Lunev
No need to create seq_operations for each instance of 'netstat'. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Replace struct net on tcp_iter_state with seq_net_private.Denis V. Lunev
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10[Syncookies]: Add support for TCP options via timestamps.Florian Westphal
Allow the use of SACK and window scaling when syncookies are used and the client supports tcp timestamps. Options are encoded into the timestamp sent in the syn-ack and restored from the timestamp echo when the ack is received. Based on earlier work by Glenn Griffin. This patch avoids increasing the size of structs by encoding TCP options into the least significant bits of the timestamp and by not using any 'timestamp offset'. The downside is that the timestamp sent in the packet after the synack will increase by several seconds. changes since v1: don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c and have ipv6/syncookies.c use it. Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if () Reviewed-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-07[TCP]: tcp_simple_retransmit can cause S+LIlpo Järvinen
This fixes Bugzilla #10384 tcp_simple_retransmit does L increment without any checking whatsoever for overflowing S+L when Reno is in use. The simplest scenario I can currently think of is rather complex in practice (there might be some more straightforward cases though). Ie., if mss is reduced during mtu probing, it may end up marking everything lost and if some duplicate ACKs arrived prior to that sacked_out will be non-zero as well, leading to S+L > packets_out, tcp_clean_rtx_queue on the next cumulative ACK or tcp_fastretrans_alert on the next duplicate ACK will fix the S counter. More straightforward (but questionable) solution would be to just call tcp_reset_reno_sack() in tcp_simple_retransmit but it would negatively impact the probe's retransmission, ie., the retransmissions would not occur if some duplicate ACKs had arrived. So I had to add reno sacked_out reseting to CA_Loss state when the first cumulative ACK arrives (this stale sacked_out might actually be the explanation for the reports of left_out overflows in kernel prior to 2.6.23 and S+L overflow reports of 2.6.24). However, this alone won't be enough to fix kernel before 2.6.24 because it is building on top of the commit 1b6d427bb7e ([TCP]: Reduce sacked_out with reno when purging write_queue) to keep the sacked_out from overflowing. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-23[TCP]: Shrink syncookie_secret by 8 byte.Florian Westphal
the first u32 copied from syncookie_secret is overwritten by the minute-counter four lines below. After adjusting the destination address, the size of syncookie_secret can be reduced accordingly. AFAICS, the only other user of syncookie_secret[] is the ipv6 syncookie support. Because ipv6 syncookies only grab 44 bytes from syncookie_secret[], this shouldn't affect them in any way. With fixes from Glenn Griffin. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Glenn Griffin <ggriffin.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21[TCP]: TCP_DEFER_ACCEPT updates - process as establishedPatrick McManus
Change TCP_DEFER_ACCEPT implementation so that it transitions a connection to ESTABLISHED after handshake is complete instead of leaving it in SYN-RECV until some data arrvies. Place connection in accept queue when first data packet arrives from slow path. Benefits: - established connection is now reset if it never makes it to the accept queue - diagnostic state of established matches with the packet traces showing completed handshake - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be enforced with reasonable accuracy instead of rounding up to next exponential back-off of syn-ack retry. Signed-off-by: Patrick McManus <mcmanus@ducksong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21[NETNS][IPV6] tcp6 - make proc per namespaceDaniel Lezcano
Make the proc for tcp6 to be per namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21[NETNS][IPV4] tcp - make proc handle the network namespacesDaniel Lezcano
This patch, like udp proc, makes the proc functions to take care of which namespace the socket belongs. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-04[TCP]: Add IPv6 support to TCP SYN cookiesGlenn Griffin
Updated to incorporate Eric's suggestion of using a per cpu buffer rather than allocating on the stack. Just a two line change, but will resend in it's entirety. Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-02-29[INET]: Remove struct net_proto_family* from _init calls.Denis V. Lunev
struct net_proto_family* is not used in icmp[v6]_init, ndisc_init, igmp_init and tcp_v4_init. Remove it. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Uninline tcp_is_cwnd_limitedIlpo Järvinen
net/ipv4/tcp_cong.c: tcp_reno_cong_avoid | -65 1 function changed, 65 bytes removed, diff: -65 net/ipv4/arp.c: arp_ignore | -5 1 function changed, 5 bytes removed, diff: -5 net/ipv4/tcp_bic.c: bictcp_cong_avoid | -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_cubic.c: bictcp_cong_avoid | -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_highspeed.c: hstcp_cong_avoid | -63 1 function changed, 63 bytes removed, diff: -63 net/ipv4/tcp_hybla.c: hybla_cong_avoid | -85 1 function changed, 85 bytes removed, diff: -85 net/ipv4/tcp_htcp.c: htcp_cong_avoid | -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_veno.c: tcp_veno_cong_avoid | -52 1 function changed, 52 bytes removed, diff: -52 net/ipv4/tcp_scalable.c: tcp_scalable_cong_avoid | -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_yeah.c: tcp_yeah_cong_avoid | -75 1 function changed, 75 bytes removed, diff: -75 net/ipv4/tcp_illinois.c: tcp_illinois_cong_avoid | -54 1 function changed, 54 bytes removed, diff: -54 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | -7 ccid3_hc_tx_packet_recv | +7 2 functions changed, 7 bytes added, 7 bytes removed, diff: +0 net/ipv4/tcp_cong.c: tcp_is_cwnd_limited | +88 1 function changed, 88 bytes added, diff: +88 built-in.o: 14 functions changed, 95 bytes added, 642 bytes removed, diff: -547 ...Again some gcc artifacts visible as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Uninline tcp_set_stateIlpo Järvinen
net/ipv4/tcp.c: tcp_close_state | -226 tcp_done | -145 tcp_close | -564 tcp_disconnect | -141 4 functions changed, 1076 bytes removed, diff: -1076 net/ipv4/tcp_input.c: tcp_fin | -86 tcp_rcv_state_process | -164 2 functions changed, 250 bytes removed, diff: -250 net/ipv4/tcp_ipv4.c: tcp_v4_connect | -209 1 function changed, 209 bytes removed, diff: -209 net/ipv4/arp.c: arp_ignore | +5 1 function changed, 5 bytes added, diff: +5 net/ipv6/tcp_ipv6.c: tcp_v6_connect | -158 1 function changed, 158 bytes removed, diff: -158 net/sunrpc/xprtsock.c: xs_sendpages | -2 1 function changed, 2 bytes removed, diff: -2 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | +7 1 function changed, 7 bytes added, diff: +7 net/ipv4/tcp.c: tcp_set_state | +238 1 function changed, 238 bytes added, diff: +238 built-in.o: 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445 I've no explanation why some unrelated changes seem to occur consistently as well (arp_ignore, ccid3_update_send_interval; I checked the arp_ignore asm and it seems to be due to some reordered of operation order causing some extra opcodes to be generated). Still, the benefits are pretty obvious from the codiff's results. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Remove TCPCB_URG & TCPCB_AT_TAIL as unnecessaryIlpo Järvinen
The snd_up check should be enough. I suspect this has been there to provide a minor optimization in clean_rtx_queue which used to have a small if (!->sacked) block which could skip snd_up check among the other work. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Introduce tcp_wnd_end() to reduce line lengthsIlpo Järvinen
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET] CORE: Introducing new memory accounting interface.Hideo Aoki
This patch introduces new memory accounting functions for each network protocol. Most of them are renamed from memory accounting functions for stream protocols. At the same time, some stream memory accounting functions are removed since other functions do same thing. Renaming: sk_stream_free_skb() -> sk_wmem_free_skb() __sk_stream_mem_reclaim() -> __sk_mem_reclaim() sk_stream_mem_reclaim() -> sk_mem_reclaim() sk_stream_mem_schedule -> __sk_mem_schedule() sk_stream_pages() -> sk_mem_pages() sk_stream_rmem_schedule() -> sk_rmem_schedule() sk_stream_wmem_schedule() -> sk_wmem_schedule() sk_charge_skb() -> sk_mem_charge() Removeing sk_stream_rfree(): consolidates into sock_rfree() sk_stream_set_owner_r(): consolidates into skb_set_owner_r() sk_stream_mem_schedule() The following functions are added. sk_has_account(): check if the protocol supports accounting sk_mem_uncharge(): do the opposite of sk_mem_charge() In addition, to achieve consolidation, updating sk_wmem_queued is removed from sk_mem_charge(). Next, to consolidate memory accounting functions, this patch adds memory accounting calls to network core functions. Moreover, present memory accounting call is renamed to new accounting call. Finally we replace present memory accounting calls with new interface in TCP and SCTP. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Convert several length variable to unsigned.YOSHIFUJI Hideaki
Several length variables cannot be negative, so convert int to unsigned int. This also allows us to do sane shift operations on those variables. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Abstract tp->highest_sack accessing & point to next skbIlpo Järvinen
Pointing to the next skb is necessary to avoid referencing already SACKed skbs which will soon be on a separate list. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Add tcp_for_write_queue_from_safe and use it in mtu_probeIlpo Järvinen
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Cong.ctrl modules: remove unused good_ack from cong_avoidIlpo Järvinen
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Move FRTO checks out from write queue abstraction funcsIlpo Järvinen
Better place exists in update_send_head (other non-queue related adjustments are done there as well) which is the only caller of tcp_advance_send_head (now that the bogus call from mtu_probe is gone). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Rewrite SACK block processing & sack_recv_cache useIlpo Järvinen
Key points of this patch are: - In case new SACK information is advance only type, no skb processing below previously discovered highest point is done - Optimize cases below highest point too since there's no need to always go up to highest point (which is very likely still present in that SACK), this is not entirely true though because I'm dropping the fastpath_skb_hint which could previously optimize those cases even better. Whether that's significant, I'm not too sure. Currently it will provide skipping by walking. Combined with RB-tree, all skipping would become fast too regardless of window size (can be done incrementally later). Previously a number of cases in TCP SACK processing fails to take advantage of costly stored information in sack_recv_cache, most importantly, expected events such as cumulative ACK and new hole ACKs. Processing on such ACKs result in rather long walks building up latencies (which easily gets nasty when window is huge). Those latencies are often completely unnecessary compared with the amount of _new_ information received, usually for cumulative ACK there's no new information at all, yet TCP walks whole queue unnecessary potentially taking a number of costly cache misses on the way, etc.! Since the inclusion of highest_sack, there's a lot information that is very likely redundant (SACK fastpath hint stuff, fackets_out, highest_sack), though there's no ultimate guarantee that they'll remain the same whole the time (in all unearthly scenarios). Take advantage of this knowledge here and drop fastpath hint and use direct access to highest SACKed skb as a replacement. Effectively "special cased" fastpath is dropped. This change adds some complexity to introduce better coveraged "fastpath", though the added complexity should make TCP behave more cache friendly. The current ACK's SACK blocks are compared against each cached block individially and only ranges that are new are then scanned by the high constant walk. For other parts of write queue, even when in previously known part of the SACK blocks, a faster skip function is used (if necessary at all). In addition, whenever possible, TCP fast-forwards to highest_sack skb that was made available by an earlier patch. In typical case, no other things but this fast-forward and mandatory markings after that occur making the access pattern quite similar to the former fastpath "special case". DSACKs are special case that must always be walked. The local to recv_sack_cache copying could be more intelligent w.r.t DSACKs which are likely to be there only once but that is left to a separate patch. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Convert highest_sack to sk_buff to allow direct accessIlpo Järvinen
It is going to replace the sack fastpath hint quite soon... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[TCP]: Splice receive support.Jens Axboe
Support for network splice receive. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19[TCP] MTUprobe: fix potential sk_send_head corruptionIlpo Järvinen
When the abstraction functions got added, conversion here was made incorrectly. As a result, the skb may end up pointing to skb which got included to the probe skb and then was freed. For it to trigger, however, skb_transmit must fail sending as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23[TCP]: Remove unneeded implicit type cast when calling tcp_minshall_update()Chuck Lever
The tcp_minshall_update() function is called in exactly one place, and is passed an unsigned integer for the mss_len argument. Make the sign of the argument match the sign of the passed variable in order to eliminate an unneeded implicit type cast and a mixed sign comparison in tcp_minshall_update(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[TCP]: Minor coding style fixup.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>