summaryrefslogtreecommitdiffstats
path: root/net/ipv4/tcp_minisocks.c
AgeCommit message (Collapse)Author
2012-05-17tcp: bool conversionsEric Dumazet
bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02tcp: early retransmitYuchung Cheng
This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-31tcp: md5: rcu conversionEric Dumazet
In order to be able to support proper RST messages for TCP MD5 flows, we need to allow access to MD5 keys without locking listener socket. This conversion is a nice cleanup, and shrinks size of timewait sockets by 80 bytes. IPv6 code reuses generic code found in IPv4 instead of duplicating it. Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Shawn Lu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-11net: use IS_ENABLED(CONFIG_IPV6)Eric Dumazet
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30tcp: inherit listener congestion control for passive cnxEric Dumazet
Rick Jones reported that TCP_CONGESTION sockopt performed on a listener was ignored for its children sockets : right after accept() the congestion control for new socket is the system default one. This seems an oversight of the initial design (quoted from Stephen) Based on prior investigation and patch from Rick. Reported-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Yuchung Cheng <ycheng@google.com> Tested-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-22net: remove ipv6_addr_copy()Alexey Dobriyan
C assignment can handle struct in6_addr copying. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-08net: rename sk_clone to sk_clone_lockEric Dumazet
Make clear that sk_clone() and inet_csk_clone() return a locked socket. Add _lock() prefix and kerneldoc. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-27ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAITEric Dumazet
commit 66b13d99d96a (ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT) fixed IPv4 only. This part is for the IPv6 side, adding a tclass param to ip6_xmit() We alias tw_tclass and tw_tos, if socket family is INET6. [ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill TOS field, TCLASS is not taken into account ] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2011-10-21tcp: add const qualifiers where possibleEric Dumazet
Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19tproxy: copy transparent flag when creating a time waitKOVACS Krisztian
The transparent socket option setting was not copied to the time wait socket when an inet socket was being replaced by a time wait socket. This broke the --transparent option of the socket match and may have caused that FIN packets belonging to sockets in FIN_WAIT2 or TIME_WAIT state were being dropped by the packet filter. Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open sideJerry Chu
This patch lowers the default initRTO from 3secs to 1sec per RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet has been retransmitted, AND the TCP timestamp option is not on. It also adds support to take RTT sample during 3WHS on the passive open side, just like its active open counterpart, and uses it, if valid, to seed the initRTO for the data transmission phase. The patch also resets ssthresh to its initial default at the beginning of the data transmission phase, and reduces cwnd to 1 if there has been MORE THAN ONE retransmission during 3WHS per RFC5681. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/ath/ath9k/ar9003_eeprom.c net/llc/af_llc.c
2010-12-08tcp: Replace time wait bucket msg by counterTom Herbert
Rather than printing the message to the log, use a mib counter to keep track of the count of occurences of time wait bucket overflow. Reduces spam in logs. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01timewait_sock: Create and use getpeer op.David S. Miller
The only thing AF-specific about remembering the timestamp for a time-wait TCP socket is getting the peer. Abstract that behind a new timewait_sock_ops vector. Support for real IPV6 sockets is not filled in yet, but curiously this makes timewait recycling start to work for v4-mapped ipv6 sockets. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30inet: Turn ->remember_stamp into ->get_peer in connection AF ops.David S. Miller
Then we can make a completely generic tcp_remember_stamp() that uses ->get_peer() as a helper, minimizing the AF specific code and minimizing the eventual code duplication when we implement the ipv6 side of TW recycling. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-23net: return operator cleanupEric Dumazet
Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12net/ipv4: EXPORT_SYMBOL cleanupsEric Dumazet
CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-11Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/stmmac/stmmac_main.c drivers/net/wireless/wl12xx/wl1271_cmd.c drivers/net/wireless/wl12xx/wl1271_main.c drivers/net/wireless/wl12xx/wl1271_spi.c net/core/ethtool.c net/mac80211/scan.c
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-21tcp: Add SNMP counter for DEFER_ACCEPTEric Dumazet
Its currently hard to diagnose when ACK frames are dropped because an application set TCP_DEFER_ACCEPT on its listening socket. See http://bugzilla.kernel.org/show_bug.cgi?id=15507 This patch adds a SNMP value, named TCPDeferAcceptDrop netstat -s | grep TCPDeferAcceptDrop TCPDeferAcceptDrop: 0 This counter is incremented every time we drop a pure ACK frame received by a socket in SYN_RECV state because its SYNACK retrans count is lower than defer_accept value. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-05net: backlog functions renameZhu Yi
sk_add_backlog -> __sk_add_backlog sk_add_backlog_limited -> sk_add_backlog Signed-off-by: Zhu Yi <yi.zhu@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-15tcp: Revert per-route SACK/DSACK/TIMESTAMP changes.David S. Miller
It creates a regression, triggering badness for SYN_RECV sockets, for example: [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293 [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000 [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32) [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR> CR: 24002442 XER: 00000000 [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000 This is likely caused by the change in the 'estab' parameter passed to tcp_parse_options() when invoked by the functions in net/ipv4/tcp_minisocks.c But even if that is fixed, the ->conn_request() changes made in this patch series is fundamentally wrong. They try to use the listening socket's 'dst' to probe the route settings. The listening socket doesn't even have a route, and you can't get the right route (the child request one) until much later after we setup all of the state, and it must be done by hand. This stuff really isn't ready, so the best thing to do is a full revert. This reverts the following commits: f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d 022c3f7d82f0f1c68018696f2f027b87b9bb45c2 1aba721eba1d84a2defce45b950272cee1e6c72a cda42ebd67ee5fdf09d7057b5a4584d36fe8a335 345cda2fd695534be5a4494f1b59da9daed33663 dc343475ed062e13fc260acccaab91d7d80fd5b2 05eaade2782fb0c90d3034fd7a7d5a16266182bb 6a2a2d6bf8581216e08be15fcb563cfd6c430e1e Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1g: Responder Cookie => InitiatorWilliam Allen Simpson
Parse incoming TCP_COOKIE option(s). Calculate <SYN,ACK> TCP_COOKIE option. Send optional <SYN,ACK> data. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1f: Initiator Cookie => Responder Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1d: define TCP cookie option, extend existing struct'sWilliam Allen Simpson
Data structures are carefully composed to require minimal additions. For example, the struct tcp_options_received cookie_plus variable fits between existing 16-bit and 8-bit variables, requiring no additional space (taking alignment into consideration). There are no additions to tcp_request_sock, and only 1 pointer in tcp_sock. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 The principle difference is using a TCP option to carry the cookie nonce, instead of a user configured offset in the data. This is more flexible and less subject to user configuration error. Such a cookie option has been suggested for many years, and is also useful without SYN data, allowing several related concepts to use the same extension option. "Re: SYN floods (was: does history repeat itself?)", September 9, 1996. http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html "Re: what a new TCP header might look like", May 12, 1998. ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail These functions will also be used in subsequent patches that implement additional features. Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1a: add request_values parameter for sending SYNACKWilliam Allen Simpson
Add optional function parameters associated with sending SYNACK. These parameters are not needed after sending SYNACK, and are not used for retransmission. Avoids extending struct tcp_request_sock, and avoids allocating kernel memory. Also affects DCCP as it uses common struct request_sock_ops, but this parameter is currently reserved for future use. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-21tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTLDavid S. Miller
That's extremely non-intuitive, noticed by William Allen Simpson. And let's make the default be on, it's been suggested by a lot of people so we'll give it a try. Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13net: TCP_MSS_DEFAULT, TCP_MSS_DESIREDWilliam Allen Simpson
Define two symbols needed in both kernel and user space. Remove old (somewhat incorrect) kernel variant that wasn't used in most cases. Default should apply to both RMSS and SMSS (RFC2581). Replace numeric constants with defined symbols. Stand-alone patch, originally developed for TCPCT. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04tcp: Do not call IPv4 specific func in tcp_check_reqGilad Ben-Yossef
Calling IPv4 specific inet_csk_route_req in tcp_check_req is a bad idea and crashes machine on IPv6 connections, as reported by Valdis Kletnieks Also, all we are really interested in is the timestamp option in the header, so calling tcp_parse_options() with the "estab" set to false flag is an overkill as it tries to parse half a dozen other TCP options. We know whether timestamp should be enabled or not using data from request_sock. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Tested-by: Valdis.Kletnieks@vt.edu Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29Allow tcp_parse_options to consult dst entryGilad Ben-Yossef
We need tcp_parse_options to be aware of dst_entry to take into account per dst_entry TCP options settings Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-29Only parse time stamp TCP option in time wait sockGilad Ben-Yossef
Since we only use tcp_parse_options here to check for the exietence of TCP timestamp option in the header, it is better to call with the "established" flag on. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Signed-off-by: Ori Finkelman <ori@comsleep.com> Signed-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19tcp: accept socket after TCP_DEFER_ACCEPT periodJulian Anastasov
Willy Tarreau and many other folks in recent years were concerned what happens when the TCP_DEFER_ACCEPT period expires for clients which sent ACK packet. They prefer clients that actively resend ACK on our SYN-ACK retransmissions to be converted from open requests to sockets and queued to the listener for accepting after the deferring period is finished. Then application server can decide to wait longer for data or to properly terminate the connection with FIN if read() returns EAGAIN which is an indication for accepting after the deferring period. This change still can have side effects for applications that expect always to see data on the accepted socket. Others can be prepared to work in both modes (with or without TCP_DEFER_ACCEPT period) and their data processing can ignore the read=EAGAIN notification and to allocate resources for clients which proved to have no data to send during the deferring period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not as a timeout) to wait for data will notice clients that didn't send data for 3 seconds but that still resend ACKs. Thanks to Willy Tarreau for the initial idea and to Eric Dumazet for the review and testing the change. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-19Revert "tcp: fix tcp_defer_accept to consider the timeout"David S. Miller
This reverts commit 6d01a026b7d3009a418326bdcf313503a314f1ea. Julian Anastasov, Willy Tarreau and Eric Dumazet have come up with a more correct way to deal with this. Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13tcp: fix tcp_defer_accept to consider the timeoutWilly Tarreau
I was trying to use TCP_DEFER_ACCEPT and noticed that if the client does not talk, the connection is never accepted and remains in SYN_RECV state until the retransmits expire, where it finally is deleted. This is bad when some firewall such as netfilter sits between the client and the server because the firewall sees the connection in ESTABLISHED state while the server will finally silently drop it without sending an RST. This behaviour contradicts the man page which says it should wait only for some time : TCP_DEFER_ACCEPT (since Linux 2.4) Allows a listener to be awakened only when data arrives on the socket. Takes an integer value (seconds), this can bound the maximum number of attempts TCP will make to complete the connection. This option should not be used in code intended to be portable. Also, looking at ipv4/tcp.c, a retransmit counter is correctly computed : case TCP_DEFER_ACCEPT: icsk->icsk_accept_queue.rskq_defer_accept = 0; if (val > 0) { /* Translate value in seconds to number of * retransmits */ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && val > ((TCP_TIMEOUT_INIT / HZ) << icsk->icsk_accept_queue.rskq_defer_accept)) icsk->icsk_accept_queue.rskq_defer_accept++; icsk->icsk_accept_queue.rskq_defer_accept++; } break; ==> rskq_defer_accept is used as a counter of retransmits. But in tcp_minisocks.c, this counter is only checked. And in fact, I have found no location which updates it. So I think that what was intended was to decrease it in tcp_minisocks whenever it is checked, which the trivial patch below does. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-15tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG()Robert Varga
I have recently came across a preemption imbalance detected by: <4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101? <0>------------[ cut here ]------------ <2>kernel BUG at /usr/src/linux/kernel/timer.c:664! <0>invalid opcode: 0000 [1] PREEMPT SMP with ffffffff80644630 being inet_twdr_hangman(). This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a bit, so I looked at what might have caused it. One thing that struck me as strange is tcp_twsk_destructor(), as it calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well, as far as I can tell. Signed-off-by: Robert Varga <nite@hq.alert.sk> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-15tcp: fix ssthresh u16 leftoverIlpo Järvinen
It was once upon time so that snd_sthresh was a 16-bit quantity. ...That has not been true for long period of time. I run across some ancient compares which still seem to trust such legacy. Put all that magic into a single place, I hopefully found all of them. Compile tested, though linking of allyesconfig is ridiculous nowadays it seems. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-02tcp: replace hard coded GFP_KERNEL with sk_allocationWu Fengguang
This fixed a lockdep warning which appeared when doing stress memory tests over NFS: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock mount_root => nfs_root_data => tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim David raised a concern that if the allocation fails in tcp_send_fin(), and it's GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting for the allocation to succeed. But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could loop endlessly under memory pressure. CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: David S. Miller <davem@davemloft.net> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-29tcp: Remove redundant copy of MD5 authentication keyJohn Dykstra
Remove the copy of the MD5 authentication key from tcp_check_req(). This key has already been copied by tcp_v4_syn_recv_sock() or tcp_v6_syn_recv_sock(). Signed-off-by: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-25tcp: missing check ACK flag of received segment in FIN-WAIT-2 stateWei Yongjun
RFC0793 defined that in FIN-WAIT-2 state if the ACK bit is off drop the segment and return[Page 72]. But this check is missing in function tcp_timewait_state_process(). This cause the segment with FIN flag but no ACK has two diffent action: Case 1: Node A Node B <------------- FIN,ACK (enter FIN-WAIT-1) ACK -------------> (enter FIN-WAIT-2) FIN -------------> discard (move sk to tw list) Case 2: Node A Node B <------------- FIN,ACK (enter FIN-WAIT-1) ACK -------------> (enter FIN-WAIT-2) (move sk to tw list) FIN -------------> <------------- ACK This patch fixed the problem. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-15tcp: consolidate paws checkIlpo Järvinen
Wow, it was quite tricky to merge that stream of negations but I think I finally got it right: check & replace_ts_recent: (s32)(rcv_tsval - ts_recent) >= 0 => 0 (s32)(ts_recent - rcv_tsval) <= 0 => 0 discard: (s32)(ts_recent - rcv_tsval) > TCP_PAWS_WINDOW => 1 (s32)(ts_recent - rcv_tsval) <= TCP_PAWS_WINDOW => 0 I toggled the return values of tcp_paws_check around since the old encoding added yet-another negation making tracking of truth-values really complicated. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-02tcp: tcp_init_wl / tcp_update_wl argument cleanupHantzis Fotis
The above functions from include/net/tcp.h have been defined with an argument that they never use. The argument is 'u32 ack' which is never used inside the function body, and thus it can be removed. The rest of the patch involves the necessary changes to the function callers of the above two functions. Signed-off-by: Hantzis Fotis <xantzis@ceid.upatras.gr> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-02tcp: kill eff_sacks "cache", the sole user can calculate itselfIlpo Järvinen
Also fixes insignificant bug that would cause sending of stale SACK block (would occur in some corner cases). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c ↵Jianjun Kong
xfrm4_policy.c Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07tcp: kill pointless urg_modeIlpo Järvinen
It all started from me noticing that this urgent check in tcp_clean_rtx_queue is unnecessarily inside the loop. Then I took a longer look to it and found out that the users of urg_mode can trivially do without, well almost, there was one gotcha. Bonus: those funny people who use urg with >= 2^31 write_seq - snd_una could now rejoice too (that's the only purpose for the between being there, otherwise a simple compare would have done the thing). Not that I assume that the rest of the tcp code happily lives with such mind-boggling numbers :-). Alas, it turned out to be impossible to set wmem to such numbers anyway, yes I really tried a big sendfile after setting some wmem but nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf... So I hacked a bit variable to long and found out that it seems to work... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-07tcp: (whitespace only) fix confusing indentationAdam Langley
The indentation in part of tcp_minisocks makes it look like one of the if statements is much more important than it actually is. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-06tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookupGui Jianfeng
If the following packet flow happen, kernel will panic. MathineA MathineB SYN ----------------------> SYN+ACK <---------------------- ACK(bad seq) ----------------------> When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr)) is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is NULL at that moment, so kernel panic happens. This patch fixes this bug. OOPS output is as following: [ 302.812793] IP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 [ 302.817075] Oops: 0000 [#1] SMP [ 302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan] [ 302.849946] [ 302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5) [ 302.855184] EIP: 0060:[<c05cfaa6>] EFLAGS: 00010296 CPU: 0 [ 302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42 [ 302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046 [ 302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54 [ 302.868333] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000) [ 302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400 [ 302.883275] c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0 [ 302.890971] ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634 [ 302.900140] Call Trace: [ 302.902392] [<c05d0d86>] tcp_v4_reqsk_send_ack+0x17/0x35 [ 302.907060] [<c05d28f8>] tcp_check_req+0x156/0x372 [ 302.910082] [<c04250a3>] printk+0x14/0x18 [ 302.912868] [<c05d0aa1>] tcp_v4_do_rcv+0x1d3/0x2bf [ 302.917423] [<c05d26be>] tcp_v4_rcv+0x563/0x5b9 [ 302.920453] [<c05bb20f>] ip_local_deliver_finish+0xe8/0x183 [ 302.923865] [<c05bb10a>] ip_rcv_finish+0x286/0x2a3 [ 302.928569] [<c059e438>] dev_alloc_skb+0x11/0x25 [ 302.931563] [<c05a211f>] netif_receive_skb+0x2d6/0x33a [ 302.934914] [<d0917941>] pcnet32_poll+0x333/0x680 [pcnet32] [ 302.938735] [<c05a3b48>] net_rx_action+0x5c/0xfe [ 302.941792] [<c042856b>] __do_softirq+0x5d/0xc1 [ 302.944788] [<c042850e>] __do_softirq+0x0/0xc1 [ 302.948999] [<c040564b>] do_softirq+0x55/0x88 [ 302.951870] [<c04501b1>] handle_fasteoi_irq+0x0/0xa4 [ 302.954986] [<c04284da>] irq_exit+0x35/0x69 [ 302.959081] [<c0405717>] do_IRQ+0x99/0xae [ 302.961896] [<c040422b>] common_interrupt+0x23/0x28 [ 302.966279] [<c040819d>] default_idle+0x2a/0x3d [ 302.969212] [<c0402552>] cpu_idle+0xb2/0xd2 [ 302.972169] ======================= [ 302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff <8b> 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6 [ 303.011610] EIP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54 [ 303.018360] Kernel panic - not syncing: Fatal exception in interrupt Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: add net to NET_INC_STATS_BHPavel Emelyanov
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16mib: add net to TCP_INC_STATS_BHPavel Emelyanov
Same as before - the sock is always there to get the net from, but there are also some places with the net already saved on the stack. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-13Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/smc911x.c
2008-06-12tcp: Revert 'process defer accept as established' changes.David S. Miller
This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38 ("tcp: Fix slab corruption with ipv6 and tcp6fuzz"). This change causes several problems, first reported by Ingo Molnar as a distcc-over-loopback regression where connections were getting stuck. Ilpo Järvinen first spotted the locking problems. The new function added by this code, tcp_defer_accept_check(), only has the child socket locked, yet it is modifying state of the parent listening socket. Fixing that is non-trivial at best, because we can't simply just grab the parent listening socket lock at this point, because it would create an ABBA deadlock. The normal ordering is parent listening socket --> child socket, but this code path would require the reverse lock ordering. Next is a problem noticed by Vitaliy Gusev, he noted: ---------------------------------------- >--- a/net/ipv4/tcp_timer.c >+++ b/net/ipv4/tcp_timer.c >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data) > goto death; > } > >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) { >+ tcp_send_active_reset(sk, GFP_ATOMIC); >+ goto death; Here socket sk is not attached to listening socket's request queue. tcp_done() will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should release this sk) as socket is not DEAD. Therefore socket sk will be lost for freeing. ---------------------------------------- Finally, Alexey Kuznetsov argues that there might not even be any real value or advantage to these new semantics even if we fix all of the bugs: ---------------------------------------- Hiding from accept() sockets with only out-of-order data only is the only thing which is impossible with old approach. Is this really so valuable? My opinion: no, this is nothing but a new loophole to consume memory without control. ---------------------------------------- So revert this thing for now. Signed-off-by: David S. Miller <davem@davemloft.net>