summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)Author
2010-11-08af_unix: optimize unix_dgram_poll()Eric Dumazet
unix_dgram_poll() is pretty expensive to check POLLOUT status, because it has to lock the socket to get its peer, take a reference on the peer to check its receive queue status, and queue another poll_wait on peer_wait. This all can be avoided if the process calling unix_dgram_poll() is not interested in POLLOUT status. It makes unix_dgram_recvmsg() faster by not queueing irrelevant pollers in peer_wait. On a test program provided by Alan Crequy : Before: real 0m0.211s user 0m0.000s sys 0m0.208s After: real 0m0.044s user 0m0.000s sys 0m0.040s Suggested-by: Davide Libenzi <davidel@xmailserver.org> Reported-by: Alban Crequy <alban.crequy@collabora.co.uk> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08af_unix: fix unix_dgram_poll() behavior for EPOLLOUT eventEric Dumazet
Alban Crequy reported a problem with connected dgram af_unix sockets and provided a test program. epoll() would miss to send an EPOLLOUT event when a thread unqueues a packet from the other peer, making its receive queue not full. This is because unix_dgram_poll() fails to call sock_poll_wait(file, &unix_sk(other)->peer_wait, wait); if the socket is not writeable at the time epoll_ctl(ADD) is called. We must call sock_poll_wait(), regardless of 'writable' status, so that epoll can be notified later of states changes. Misc: avoids testing twice (sk->sk_shutdown & RCV_SHUTDOWN) Reported-by: Alban Crequy <alban.crequy@collabora.co.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08af_unix: use keyed wakeupsEric Dumazet
Instead of wakeup all sleepers, use wake_up_interruptible_sync_poll() to wakeup only ones interested into writing the socket. This patch is a specialization of commit 37e5540b3c9d (epoll keyed wakeups: make sockets use keyed wakeups). On a test program provided by Alan Crequy : Before: real 0m3.101s user 0m0.000s sys 0m6.104s After: real 0m0.211s user 0m0.000s sys 0m0.208s Reported-by: Alban Crequy <alban.crequy@collabora.co.uk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08decnet: RCU conversion and get rid of dev_base_lockEric Dumazet
While tracking dev_base_lock users, I found decnet used it in dnet_select_source(), but for a wrong purpose: Writers only hold RTNL, not dev_base_lock, so readers must use RCU if they cannot use RTNL. Adds an rcu_head in struct dn_ifaddr and handle proper RCU management. Adds __rcu annotation in dn_route as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2010-11-08rds: Fix rds message leak in rds_message_map_pagesPavel Emelyanov
The sgs allocation error path leaks the allocated message. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08pktgen: correct uninitialized queue_mapJunchang Wang
This fix a bug reported by backyes. Right the first time pktgen's using queue_map that's not been initialized by set_cur_queue_map(pkt_dev); Signed-off-by: Junchang Wang <junchangwang@gmail.com> Signed-off-by: Backyes <backyes@mail.ustc.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08ipv6: fix overlap check for fragmentsShan Wei
The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int, and offset is int. Without this patch, type conversion occurred to this expression, when (FRAG6_CB(prev)->offset + prev->len) is less than offset. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08classifier: report statistics for basic classifierstephen hemminger
The basic classifier keeps statistics but does not report it to user space. This showed up when using basic classifier (with police) as a default catch all on ingress; no statistics were reported. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-07NET: pktgen - fix compile warningDmitry Torokhov
This should fix the following warning: net/core/pktgen.c: In function ‘pktgen_if_write’: net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Reviewed-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits) inet_diag: Make sure we actually run the same bytecode we audited. netlink: Make nlmsg_find_attr take a const nlmsghdr*. fib: fib_result_assign() should not change fib refcounts netfilter: ip6_tables: fix information leak to userspace cls_cgroup: Fix crash on module unload memory corruption in X.25 facilities parsing net dst: fix percpu_counter list corruption and poison overwritten rds: Remove kfreed tcp conn from list rds: Lost locking in loop connection freeing de2104x: fix panic on load atl1 : fix panic on load netxen: remove unused firmware exports caif: Remove noisy printout when disconnecting caif socket caif: SPI-driver bugfix - incorrect padding. caif: Bugfix for socket priority, bindtodev and dbg channel. smsc911x: Set Ethernet EEPROM size to supported device's size ipv4: netfilter: ip_tables: fix information leak to userland ipv4: netfilter: arp_tables: fix information leak to userland cxgb4vf: remove call to stop TX queues at load time. cxgb4: remove call to stop TX queues at load time. ...
2010-11-04inet_diag: Make sure we actually run the same bytecode we audited.Nelson Elhage
We were using nlmsg_find_attr() to look up the bytecode by attribute when auditing, but then just using the first attribute when actually running bytecode. So, if we received a message with two attribute elements, where only the second had type INET_DIAG_REQ_BYTECODE, we would validate and run different bytecode strings. Fix this by consistently using nlmsg_find_attr everywhere. Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: Thomas Graf <tgraf@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-04fib: fib_result_assign() should not change fib refcountsEric Dumazet
After commit ebc0ffae5 (RCU conversion of fib_lookup()), fib_result_assign() should not change fib refcounts anymore. Thanks to Michael who did the bisection and bug report. Reported-by: Michael Ellerman <michael@ellerman.id.au> Tested-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03netfilter: ip6_tables: fix information leak to userspaceJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6
2010-11-03cls_cgroup: Fix crash on module unloadHerbert Xu
Somewhere along the lines net_cls_subsys_id became a macro when cls_cgroup is built as a module. Not only did it make cls_cgroup completely useless, it also causes it to crash on module unload. This patch fixes this by removing that macro. Thanks to Eric Dumazet for diagnosing this problem. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03memory corruption in X.25 facilities parsingandrew hendry
Signed-of-by: Andrew Hendry <andrew.hendry@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03net dst: fix percpu_counter list corruption and poison overwrittenXiaotian Feng
There're some percpu_counter list corruption and poison overwritten warnings in recent kernel, which is resulted by fc66f95c. commit fc66f95c switches to use percpu_counter, in ip6_route_net_init, kernel init the percpu_counter for dst entries, but, the percpu_counter is never destroyed in ip6_route_net_exit. So if the related data is freed by kernel, the freed percpu_counter is still on the list, then if we insert/remove other percpu_counter, list corruption resulted. Also, if the insert/remove option modifies the ->prev,->next pointer of the freed value, the poison overwritten is resulted then. With the following patch, the percpu_counter list corruption and poison overwritten warnings disappeared. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03rds: Remove kfreed tcp conn from listPavel Emelyanov
All the rds_tcp_connection objects are stored list, but when being freed it should be removed from there. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03rds: Lost locking in loop connection freeingPavel Emelyanov
The conn is removed from list in there and this requires proper lock protection. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03caif: Remove noisy printout when disconnecting caif socketsjur.brandeland@stericsson.com
Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03caif: Bugfix for socket priority, bindtodev and dbg channel.André Carvalho de Matos
Changes: o Bugfix: SO_PRIORITY for SOL_SOCKET could not be handled in caif's setsockopt, using the struct sock attribute priority instead. o Bugfix: SO_BINDTODEVICE for SOL_SOCKET could not be handled in caif's setsockopt, using the struct sock attribute ifindex instead. o Wrong assert statement for RFM layer segmentation. o CAIF Debug channels was not working over SPI, caif_payload_info containing padding info must be initialized. o Check on pointer before dereferencing when unregister dev in caif_dev.c Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03ipv4: netfilter: ip_tables: fix information leak to userlandVasiliy Kulikov
Structure ipt_getinfo is copied to userland with the field "name" that has the last elements unitialized. It leads to leaking of contents of kernel stack memory. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-11-03ipv4: netfilter: arp_tables: fix information leak to userlandVasiliy Kulikov
Structure arpt_getinfo is copied to userland with the field "name" that has the last elements unitialized. It leads to leaking of contents of kernel stack memory. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-11-01net: check queue_index from sock is valid for deviceTom Herbert
In dev_pick_tx recompute the queue index if the value stored in the socket is greater than or equal to the number of real queues for the device. The saved index in the sock structure is not guaranteed to be appropriate for the egress device (this could happen on a route change or in presence of tunnelling). The result of the queue index being bad would be to return a bogus queue (crash could prersumably follow). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-01l2tp: kzalloc with swapped params in l2tp_dfs_seq_openDr. David Alan Gilbert
'sparse' spotted that the parameters to kzalloc in l2tp_dfs_seq_open were swapped. Tested on current git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git at 1792f17b7210280a3d7ff29da9614ba779cfcedb build, boots and I can see that directory, but there again I could see /sys/kernel/debug/l2tp with it swapped; I don't have any l2tp in use. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-31text ematch: check for NULL pointer before destroying textsearch configThomas Graf
While validating the configuration em_ops is already set, thus the individual destroy functions are called, but the ematch data has not been allocated and associated with the ematch yet. Signed-off-by: Thomas Graf <tgraf@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: isdn: mISDN: socket: fix information leak to userland netdev: can: Change mail address of Hans J. Koch pcnet_cs: add new_id net: Truncate recvfrom and sendto length to INT_MAX. RDS: Let rds_message_alloc_sgs() return NULL RDS: Copy rds_iovecs into kernel memory instead of rereading from userspace RDS: Clean up error handling in rds_cmsg_rdma_args RDS: Return -EINVAL if rds_rdma_pages returns an error net: fix rds_iovec page count overflow can: pch_can: fix section mismatch warning by using a whitelisted name can: pch_can: fix sparse warning netxen_nic: Fix the tx queue manipulation bug in netxen_nic_probe ip_gre: fix fallback tunnel setup vmxnet: trivial annotation of protocol constant vmxnet3: remove unnecessary byteswapping in BAR writing macros ipv6/udp: report SndbufErrors and RcvbufErrors phy/marvell: rename 88ec048 to 88e1318s and fix mscr1 addr
2010-10-30net: Truncate recvfrom and sendto length to INT_MAX.Linus Torvalds
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30RDS: Let rds_message_alloc_sgs() return NULLAndy Grover
Even with the previous fix, we still are reading the iovecs once to determine SGs needed, and then again later on. Preallocating space for sg lists as part of rds_message seemed like a good idea but it might be better to not do this. While working to redo that code, this patch attempts to protect against userspace rewriting the rds_iovec array between the first and second accesses. The consequences of this would be either a too-small or too-large sg list array. Too large is not an issue. This patch changes all callers of message_alloc_sgs to handle running out of preallocated sgs, and fail gracefully. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30RDS: Copy rds_iovecs into kernel memory instead of rereading from userspaceAndy Grover
Change rds_rdma_pages to take a passed-in rds_iovec array instead of doing copy_from_user itself. Change rds_cmsg_rdma_args to copy rds_iovec array once only. This eliminates the possibility of userspace changing it after our sanity checks. Implement stack-based storage for small numbers of iovecs, based on net/socket.c, to save an alloc in the extremely common case. Although this patch reduces iovec copies in cmsg_rdma_args to 1, we still do another one in rds_rdma_extra_size. Getting rid of that one will be trickier, so it'll be a separate patch. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30RDS: Clean up error handling in rds_cmsg_rdma_argsAndy Grover
We don't need to set ret = 0 at the end -- it's initialized to 0. Also, don't increment s_send_rdma stat if we're exiting with an error. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30RDS: Return -EINVAL if rds_rdma_pages returns an errorAndy Grover
rds_cmsg_rdma_args would still return success even if rds_rdma_pages returned an error (or overflowed). Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30net: fix rds_iovec page count overflowLinus Torvalds
As reported by Thomas Pollet, the rdma page counting can overflow. We get the rdma sizes in 64-bit unsigned entities, but then limit it to UINT_MAX bytes and shift them down to pages (so with a possible "+1" for an unaligned address). So each individual page count fits comfortably in an 'unsigned int' (not even close to overflowing into signed), but as they are added up, they might end up resulting in a signed return value. Which would be wrong. Catch the case of tot_pages turning negative, and return the appropriate error code. Reported-by: Thomas Pollet <thomas.pollet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30ip_gre: fix fallback tunnel setupEric Dumazet
Before making the fallback tunnel visible to lookups, we should make sure it is completely setup, once ipgre_tunnel_init() had been called and tstats per_cpu pointer allocated. move rcu_assign_pointer(ign->tunnels_wc[0], tunnel); from ipgre_fb_tunnel_init() to ipgre_init_net() Based on a patch from Pavel Emelyanov Reported-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30ipv6/udp: report SndbufErrors and RcvbufErrorsEric Dumazet
commit a18135eb9389 (Add UDP_MIB_{SND,RCV}BUFERRORS handling.) forgot to make the necessary changes in net/ipv6/proc.c to report additional counters in /proc/net/snmp6 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits) b43: Fix warning at drivers/mmc/core/core.c:237 in mmc_wait_for_cmd mac80211: fix failure to check kmalloc return value in key_key_read libertas: Fix sd8686 firmware reload ath9k: Fix incorrect access of rate flags in RC netfilter: xt_socket: Make tproto signed in socket_mt6_v1(). stmmac: enable/disable rx/tx in the core with a single write. net: atarilance - flags should be unsigned long netxen: fix kdump pktgen: Limit how much data we copy onto the stack. net: Limit socket I/O iovec total length to INT_MAX. USB: gadget: fix ethernet gadget crash in gether_setup fib: Fix fib zone and its hash leak on namespace stop cxgb3: Fix panic in free_tx_desc() cxgb3: fix crash due to manipulating queues before registration 8390: Don't oops on starting dev queue dccp ccid-2: Stop polling dccp: Refine the wait-for-ccid mechanism dccp: Extend CCID packet dequeueing interface dccp: Return-value convention of hc_tx_send_packet() igbvf: fix panic on load ...
2010-10-29Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6
2010-10-29mac80211: fix failure to check kmalloc return value in key_key_readJesper Juhl
I noticed two small issues in mac80211/debugfs_key.c::key_key_read while reading through the code. Patch below. The key_key_read() function returns ssize_t and the value that's actually returned is the return value of simple_read_from_buffer() which also returns ssize_t, so let's hold the return value in a ssize_t local variable rather than a int one. Also, memory is allocated dynamically with kmalloc() which can fail, but the return value of kmalloc() is not checked, so we may end up operating on a null pointer further on. So check for a NULL return and bail out with -ENOMEM in that case. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-10-29netfilter: fix nf_conntrack_l4proto_register()Eric Dumazet
While doing __rcu annotations work on net/netfilter I found following bug. On some arches, it is possible we publish a table while its content is not yet committed to memory, and lockless reader can dereference wild pointer. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-29netfilter: nf_nat: fix compiler warning with CONFIG_NF_CT_NETLINK=nPatrick McHardy
net/ipv4/netfilter/nf_nat_core.c:52: warning: 'nf_nat_proto_find_get' defined but not used net/ipv4/netfilter/nf_nat_core.c:66: warning: 'nf_nat_proto_put' defined but not used Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-29convert get_sb_pseudo() usersAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29convert get_sb_single() usersAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-28netfilter: xt_socket: Make tproto signed in socket_mt6_v1().David S. Miller
Otherwise error indications from ipv6_find_hdr() won't be noticed. This required making the protocol argument to extract_icmp6_fields() signed too. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-28pktgen: Limit how much data we copy onto the stack.Nelson Elhage
A program that accidentally writes too much data to the pktgen file can overflow the kernel stack and oops the machine. This is only triggerable by root, so there's no security issue, but it's still an unfortunate bug. printk() won't print more than 1024 bytes in a single call, anyways, so let's just never copy more than that much data. We're on a fairly shallow stack, so that should be safe even with CONFIG_4KSTACKS. Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-28net: Limit socket I/O iovec total length to INT_MAX.David S. Miller
This helps protect us from overflow issues down in the individual protocol sendmsg/recvmsg handlers. Once we hit INT_MAX we truncate out the rest of the iovec by setting the iov_len members to zero. This works because: 1) For SOCK_STREAM and SOCK_SEQPACKET sockets, partial writes are allowed and the application will just continue with another write to send the rest of the data. 2) For datagram oriented sockets, where there must be a one-to-one correspondance between write() calls and packets on the wire, INT_MAX is going to be far larger than the packet size limit the protocol is going to check for and signal with -EMSGSIZE. Based upon a patch by Linus Torvalds. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-28fib: Fix fib zone and its hash leak on namespace stopPavel Emelyanov
When we stop a namespace we flush the table and free one, but the added fn_zone-s (and their hashes if grown) are leaked. Need to free. Tries releases all its stuff in the flushing code. Shame on us - this bug exists since the very first make-fib-per-net patches in 2.6.27 :( Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-28dccp ccid-2: Stop pollingGerrit Renker
This updates CCID-2 to use the CCID dequeuing mechanism, converting from previous continuous-polling to a now event-driven mechanism. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-28dccp: Refine the wait-for-ccid mechanismGerrit Renker
This extends the existing wait-for-ccid routine so that it may be used with different types of CCID, addressing the following problems: 1) The queue-drain mechanism only works with rate-based CCIDs. If CCID-2 for example has a full TX queue and becomes network-limited just as the application wants to close, then waiting for CCID-2 to become unblocked could lead to an indefinite delay (i.e., application "hangs"). 2) Since each TX CCID in turn uses a feedback mechanism, there may be changes in its sending policy while the queue is being drained. This can lead to further delays during which the application will not be able to terminate. 3) The minimum wait time for CCID-3/4 can be expected to be the queue length times the current inter-packet delay. For example if tx_qlen=100 and a delay of 15 ms is used for each packet, then the application would have to wait for a minimum of 1.5 seconds before being allowed to exit. 4) There is no way for the user/application to control this behaviour. It would be good to use the timeout argument of dccp_close() as an upper bound. Then the maximum time that an application is willing to wait for its CCIDs to can be set via the SO_LINGER option. These problems are addressed by giving the CCID a grace period of up to the `timeout' value. The wait-for-ccid function is, as before, used when the application (a) has read all the data in its receive buffer and (b) if SO_LINGER was set with a non-zero linger time, or (c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ state (client application closes after receiving CloseReq). In addition, there is a catch-all case of __skb_queue_purge() after waiting for the CCID. This is necessary since the write queue may still have data when (a) the host has been passively-closed, (b) abnormal termination (unread data, zero linger time), (c) wait-for-ccid could not finish within the given time limit. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-28dccp: Extend CCID packet dequeueing interfaceGerrit Renker
This extends the packet dequeuing interface of dccp_write_xmit() to allow 1. CCIDs to take care of timing when the next packet may be sent; 2. delayed sending (as before, with an inter-packet gap up to 65.535 seconds). The main purpose is to take CCID-2 out of its polling mode (when it is network- limited, it tries every millisecond to send, without interruption). The mode of operation for (2) is as follows: * new packet is enqueued via dccp_sendmsg() => dccp_write_xmit(), * ccid_hc_tx_send_packet() detects that it may not send (e.g. window full), * it signals this condition via `CCID_PACKET_WILL_DEQUEUE_LATER', * dccp_write_xmit() returns without further action; * after some time the wait-condition for CCID becomes true, * that CCID schedules the tasklet, * tasklet function calls ccid_hc_tx_send_packet() via dccp_write_xmit(), * since the wait-condition is now true, ccid_hc_tx_packet() returns "send now", * packet is sent, and possibly more (since dccp_write_xmit() loops). Code reuse: the taskled function calls dccp_write_xmit(), the timer function reduces to a wrapper around the same code. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>