summaryrefslogtreecommitdiffstats
path: root/net/ipv4
AgeCommit message (Collapse)Author
2013-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking updates from David Miller: "Here is a pile of bug fixes that accumulated while I was in Europe" 1) In fixing kernel leaks to userspace during copying of socket addresses, we broke a case that used to work, namely the user providing a buffer larger than the in-kernel generic socket address structure. This broke Ruby amongst other things. Fix from Dan Carpenter. 2) Fix regression added by byte queue limit support in 8139cp driver, from Yang Yingliang. 3) The addition of MSG_SENDPAGE_NOTLAST buggered up a few sendpage implementations, they should just treat it the same as MSG_MORE. Fix from Richard Weinberger and Shawn Landden. 4) Handle icmpv4 errors received on ipv6 SIT tunnels correctly, from Oussama Ghorbel. In particular we should send an ICMPv6 unreachable in such situations. 5) Fix some regressions in the recent genetlink fixes, in particular get the pmcraid driver to use the new safer interfaces correctly. From Johannes Berg. 6) macvtap was converted to use a per-cpu set of statistics, but some code was still bumping tx_dropped elsewhere. From Jason Wang. 7) Fix build failure of xen-netback due to missing include on some architectures, from Andy Whitecroft. 8) macvtap double counts received packets in statistics, fix from Vlad Yasevich. 9) Fix various cases of using *_STATS_BH() when *_STATS() is more appropriate. From Eric Dumazet and Hannes Frederic Sowa. 10) Pktgen ipsec mode doesn't update the ipv4 header length and checksum properly after encapsulation. Fix from Fan Du. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits) net/mlx4_en: Remove selftest TX queues empty condition {pktgen, xfrm} Update IPv4 header total len and checksum after tranformation virtio_net: make all RX paths handle erors consistently virtio_net: fix error handling for mergeable buffers virtio_net: Fixed a trivial typo (fitler --> filter) netem: fix gemodel loss generator netem: fix loss 4 state model netem: missing break in ge loss generator net/hsr: Support iproute print_opt ('ip -details ...') net/hsr: Very small fix of comment style. MAINTAINERS: Added net/hsr/ maintainer ipv6: fix possible seqlock deadlock in ip6_finish_output2 ixgbe: Make ixgbe_identify_qsfp_module_generic static ixgbe: turn NETIF_F_HW_L2FW_DOFFLOAD off by default ixgbe: ixgbe_fwd_ring_down needs to be static e1000: fix possible reset_task running after adapter down e1000: fix lockdep warning in e1000_reset_task e1000: prevent oops when adapter is being closed and reset simultaneously igb: Fixed Wake On LAN support inet: fix possible seqlock deadlocks ...
2013-11-29inet: fix possible seqlock deadlocksEric Dumazet
In commit c9e9042994d3 ("ipv4: fix possible seqlock deadlock") I left another places where IP_INC_STATS_BH() were improperly used. udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from process context, not from softirq context. This was detected by lockdep seqlock support. Reported-by: jongman heo <jongman.heo@samsung.com> Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29net: update consumers of MSG_MORE to recognize MSG_SENDPAGE_NOTLASTShawn Landden
Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once) added an internal flag MSG_SENDPAGE_NOTLAST, similar to MSG_MORE. algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages() and need to see the new flag as identical to MSG_MORE. This fixes sendfile() on AF_ALG. v3: also fix udp Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: <stable@vger.kernel.org> # 3.4.x + 3.2.x Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com> Original-patch: Richard Weinberger <richard@nod.at> Signed-off-by: Shawn Landden <shawn@churchofgit.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-28net: remove outdated comment for ipv4 and ipv6 protocol handlerBaker Zhang
since f9242b6b28d61295f2bf7e8adfb1060b382e5381 inet: Sanitize inet{,6} protocol demux. there are not pretended hash tables for ipv4 or ipv6 protocol handler. Signed-off-by: Baker Zhang <Baker.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu ↵Hannes Frederic Sowa
functions Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") conditionally updated addr_len if the msg_name is written to. The recv_error and rxpmtu functions relied on the recvmsg functions to set up addr_len before. As this does not happen any more we have to pass addr_len to those functions as well and set it to the size of the corresponding sockaddr length. This broke traceroute and such. Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") Reported-by: Brad Spengler <spender@grsecurity.net> Reported-by: Tom Labanowski Cc: mpb <mpb.mail@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23tcp_memcg: remove useless var old_limGao feng
nobody needs it. remove. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23gro: Clean up tcpX_gro_receive checksum verificationHerbert Xu
This patch simplifies the checksum verification in tcpX_gro_receive by reusing the CHECKSUM_COMPLETE code for CHECKSUM_NONE. All it does for CHECKSUM_NONE is compute the partial checksum and then treat it as if it came from the hardware (CHECKSUM_COMPLETE). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cheers, Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-23gro: Only verify TCP checksums for candidatesHerbert Xu
In some cases we may receive IP packets that are longer than their stated lengths. Such packets are never merged in GRO. However, we may end up computing their checksums incorrectly and end up allowing packets with a bogus checksum enter our stack with the checksum status set as verified. Since such packets are rare and not performance-critical, this patch simply skips the checksum verification for them. Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Thanks, Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix memory leaks and other issues in mwifiex driver, from Amitkumar Karwar. 2) skb_segment() can choke on packets using frag lists, fix from Herbert Xu with help from Eric Dumazet and others. 3) IPv4 output cached route instantiation properly handles races involving two threads trying to install the same route, but we forgot to propagate this logic to input routes as well. Fix from Alexei Starovoitov. 4) Put protections in place to make sure that recvmsg() paths never accidently copy uninitialized memory back into userspace and also make sure that we never try to use more that sockaddr_storage for building the on-kernel-stack copy of a sockaddr. Fixes from Hannes Frederic Sowa. 5) R8152 driver transmit flow bug fixes from Hayes Wang. 6) Fix some minor fallouts from genetlink changes, from Johannes Berg and Michael Opdenacker. 7) AF_PACKET sendmsg path can race with netdevice unregister notifier, fix by using RCU to make sure the network device doesn't go away from under us. Fix from Daniel Borkmann. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits) gso: handle new frag_list of frags GRO packets genetlink: fix genl_set_err() group ID genetlink: fix genlmsg_multicast() bug packet: fix use after free race in send path when dev is released xen-netback: stop the VIF thread before unbinding IRQs wimax: remove dead code net/phy: Add the autocross feature for forced links on VSC82x4 net/phy: Add VSC8662 support net/phy: Add VSC8574 support net/phy: Add VSC8234 support net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage) net: rework recvmsg handler msg_name and msg_namelen logic bridge: flush br's address entry in fdb when remove the net: core: Always propagate flag changes to interfaces ipv4: fix race in concurrent ip_route_input_slow() r8152: fix incorrect type in assignment r8152: support stopping/waking tx queue r8152: modify the tx flow r8152: fix tx/rx memory overflow netfilter: ebt_ip6: fix source and destination matching ...
2013-11-21Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter fixes for net The following patchset contains fixes for your net tree, they are: * Remove extra quote from connlimit configuration in Kconfig, from Randy Dunlap. * Fix missing mss option in syn packets sent to the backend in our new synproxy target, from Martin Topholm. * Use window scale announced by client when sending the forged syn to the backend, from Martin Topholm. * Fix IPv6 address comparison in ebtables, from Luís Fernando Cornachioni Estrozi. * Fix wrong endianess in sequence adjustment which breaks helpers in NAT configurations, from Phil Oester. * Fix the error path handling of nft_compat, from me. * Make sure the global conntrack counter is decremented after the object has been released, also from me. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-20Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds
Pull slave-dmaengine changes from Vinod Koul: "This brings for slave dmaengine: - Change dma notification flag to DMA_COMPLETE from DMA_SUCCESS as dmaengine can only transfer and not verify validaty of dma transfers - Bunch of fixes across drivers: - cppi41 driver fixes from Daniel - 8 channel freescale dma engine support and updated bindings from Hongbo - msx-dma fixes and cleanup by Markus - DMAengine updates from Dan: - Bartlomiej and Dan finalized a rework of the dma address unmap implementation. - In the course of testing 1/ a collection of enhancements to dmatest fell out. Notably basic performance statistics, and fixed / enhanced test control through new module parameters 'run', 'wait', 'noverify', and 'verbose'. Thanks to Andriy and Linus [Walleij] for their review. - Testing the raid related corner cases of 1/ triggered bugs in the recently added 16-source operation support in the ioatdma driver. - Some minor fixes / cleanups to mv_xor and ioatdma" * 'next' of git://git.infradead.org/users/vkoul/slave-dma: (99 commits) dma: mv_xor: Fix mis-usage of mmio 'base' and 'high_base' registers dma: mv_xor: Remove unneeded NULL address check ioat: fix ioat3_irq_reinit ioat: kill msix_single_vector support raid6test: add new corner case for ioatdma driver ioatdma: clean up sed pool kmem_cache ioatdma: fix selection of 16 vs 8 source path ioatdma: fix sed pool selection ioatdma: Fix bug in selftest after removal of DMA_MEMSET. dmatest: verbose mode dmatest: convert to dmaengine_unmap_data dmatest: add a 'wait' parameter dmatest: add basic performance metrics dmatest: add support for skipping verification and random data setup dmatest: use pseudo random numbers dmatest: support xor-only, or pq-only channels in tests dmatest: restore ability to start test at module load and init dmatest: cleanup redundant "dmatest: " prefixes dmatest: replace stored results mechanism, with uniform messages Revert "dmatest: append verify result to results" ...
2013-11-20ipv4: fix race in concurrent ip_route_input_slow()Alexei Starovoitov
CPUs can ask for local route via ip_route_input_noref() concurrently. if nh_rth_input is not cached yet, CPUs will proceed to allocate equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input via rt_cache_route() Most of the time they succeed, but on occasion the following two lines: orig = *p; prev = cmpxchg(p, orig, rt); in rt_cache_route() do race and one of the cpus fails to complete cmpxchg. But ip_route_input_slow() doesn't check the return code of rt_cache_route(), so dst is leaking. dst_destroy() is never called and 'lo' device refcnt doesn't go to zero, which can be seen in the logs as: unregister_netdevice: waiting for lo to become free. Usage count = 1 Adding mdelay() between above two lines makes it easily reproducible. Fix it similar to nh_pcpu_rth_output case. Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Mostly these are fixes for fallout due to merge window changes, as well as cures for problems that have been with us for a much longer period of time" 1) Johannes Berg noticed two major deficiencies in our genetlink registration. Some genetlink protocols we passing in constant counts for their ops array rather than something like ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were using fixed IDs for their multicast groups. We have to retain these fixed IDs to keep existing userland tools working, but reserve them so that other multicast groups used by other protocols can not possibly conflict. In dealing with these two problems, we actually now use less state management for genetlink operations and multicast groups. 2) When configuring interface hardware timestamping, fix several drivers that simply do not validate that the hwtstamp_config value is one the driver actually supports. From Ben Hutchings. 3) Invalid memory references in mwifiex driver, from Amitkumar Karwar. 4) In dev_forward_skb(), set the skb->protocol in the right order relative to skb_scrub_packet(). From Alexei Starovoitov. 5) Bridge erroneously fails to use the proper wrapper functions to make calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki Makita. 6) When detaching a bridge port, make sure to flush all VLAN IDs to prevent them from leaking, also from Toshiaki Makita. 7) Put in a compromise for TCP Small Queues so that deep queued devices that delay TX reclaim non-trivially don't have such a performance decrease. One particularly problematic area is 802.11 AMPDU in wireless. From Eric Dumazet. 8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts here. Fix from Eric Dumzaet, reported by Dave Jones. 9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn. 10) When computing mergeable buffer sizes, virtio-net fails to take the virtio-net header into account. From Michael Dalton. 11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic bumping, this one has been with us for a while. From Eric Dumazet. 12) Fix NULL deref in the new TIPC fragmentation handling, from Erik Hugne. 13) 6lowpan bit used for traffic classification was wrong, from Jukka Rissanen. 14) macvlan has the same issue as normal vlans did wrt. propagating LRO disabling down to the real device, fix it the same way. From Michal Kubecek. 15) CPSW driver needs to soft reset all slaves during suspend, from Daniel Mack. 16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet. 17) The xen-netfront RX buffer refill timer isn't properly scheduled on partial RX allocation success, from Ma JieYue. 18) When ipv6 ping protocol support was added, the AF_INET6 protocol initialization cleanup path on failure was borked a little. Fix from Vlad Yasevich. 19) If a socket disconnects during a read/recvmsg/recvfrom/etc that blocks we can do the wrong thing with the msg_name we write back to userspace. From Hannes Frederic Sowa. There is another fix in the works from Hannes which will prevent future problems of this nature. 20) Fix route leak in VTI tunnel transmit, from Fan Du. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) genetlink: make multicast groups const, prevent abuse genetlink: pass family to functions using groups genetlink: add and use genl_set_err() genetlink: remove family pointer from genl_multicast_group genetlink: remove genl_unregister_mc_group() hsr: don't call genl_unregister_mc_group() quota/genetlink: use proper genetlink multicast APIs drop_monitor/genetlink: use proper genetlink multicast APIs genetlink: only pass array to genl_register_family_with_ops() tcp: don't update snd_nxt, when a socket is switched from repair mode atm: idt77252: fix dev refcnt leak xfrm: Release dst if this dst is improper for vti tunnel netlink: fix documentation typo in netlink_set_err() be2net: Delete secondary unicast MAC addresses during be_close be2net: Fix unconditional enabling of Rx interface options net, virtio_net: replace the magic value ping: prevent NULL pointer dereference on write to msg_name bnx2x: Prevent "timeout waiting for state X" bnx2x: prevent CFC attention bnx2x: Prevent panic during DMAE timeout ...
2013-11-19genetlink: only pass array to genl_register_family_with_ops()Johannes Berg
As suggested by David Miller, make genl_register_family_with_ops() a macro and pass only the array, evaluating ARRAY_SIZE() in the macro, this is a little safer. The openvswitch has some indirection, assing ops/n_ops directly in that code. This might ultimately just assign the pointers in the family initializations, saving the struct genl_family_and_ops and code (once mcast groups are handled differently.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19tcp: don't update snd_nxt, when a socket is switched from repair modeAndrey Vagin
snd_nxt must be updated synchronously with sk_send_head. Otherwise tp->packets_out may be updated incorrectly, what may bring a kernel panic. Here is a kernel panic from my host. [ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150 ... [ 146.301158] Call Trace: [ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0 Before this panic a tcp socket was restored. This socket had sent and unsent data in the write queue. Sent data was restored in repair mode, then the socket was switched from reapair mode and unsent data was restored. After that the socket was switched back into repair mode. In that moment we had a socket where write queue looks like this: snd_una snd_nxt write_seq |_________|________| | sk_send_head After a second switching from repair mode the state of socket was changed: snd_una snd_nxt, write_seq |_________ ________| | sk_send_head This state is inconsistent, because snd_nxt and sk_send_head are not synchronized. Bellow you can find a call trace, how packets_out can be incremented twice for one skb, if snd_nxt and sk_send_head are not synchronized. In this case packets_out will be always positive, even when sk_write_queue is empty. tcp_write_wakeup skb = tcp_send_head(sk); tcp_fragment if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq)) tcp_adjust_pcount(sk, skb, diff); tcp_event_new_data_sent tp->packets_out += tcp_skb_pcount(skb); I think update of snd_nxt isn't required, when a socket is switched from repair mode. Because it's initialized in tcp_connect_init. Then when a write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent, so it's always is in consistent state. I have checked, that the bug is not reproduced with this patch and all tests about restoring tcp connections work fine. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19xfrm: Release dst if this dst is improper for vti tunnelfan.du
After searching rt by the vti tunnel dst/src parameter, if this rt has neither attached to any transformation nor the transformation is not tunnel oriented, this rt should be released back to ip layer. otherwise causing dst memory leakage. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-18ping: prevent NULL pointer dereference on write to msg_nameHannes Frederic Sowa
A plain read() on a socket does set msg->msg_name to NULL. So check for NULL pointer first. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-18inet: prevent leakage of uninitialized memory to user in recv syscallsHannes Frederic Sowa
Only update *addr_len when we actually fill in sockaddr, otherwise we can return uninitialized memory from the stack to the caller in the recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL) checks because we only get called with a valid addr_len pointer either from sock_common_recvmsg or inet_recvmsg. If a blocking read waits on a socket which is concurrently shut down we now return zero and set msg_msgnamelen to 0. Reported-by: mpb <mpb.mail@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-18netfilter: synproxy: send mss option to backendMartin Topholm
When the synproxy_parse_options is called on the client ack the mss option will not be present. Consequently mss wont be included in the backend syn packet, which falls back to 536 bytes mss. Therefore XT_SYNPROXY_OPT_MSS is explicitly flagged when recovering mss value from cookie. Signed-off-by: Martin Topholm <mph@one.com> Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-15seq_file: remove "%n" usage from seq_file usersTetsuo Handa
All seq_printf() users are using "%n" for calculating padding size, convert them to use seq_setwidth() / seq_pad() pair. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-14ipv4: fix possible seqlock deadlockEric Dumazet
ip4_datagram_connect() being called from process context, it should use IP_INC_STATS() instead of IP_INC_STATS_BH() otherwise we can deadlock on 32bit arches, or get corruptions of SNMP counters. Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14genetlink: make all genl_ops users constJohannes Berg
Now that genl_ops are no longer modified in place when registering, they can be made const. This patch was done mostly with spatch: @@ identifier ops; @@ +const struct genl_ops ops[] = { ... }; (except the struct thing in net/openvswitch/datapath.c) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14net-tcp: fix panic in tcp_fastopen_cache_set()Eric Dumazet
We had some reports of crashes using TCP fastopen, and Dave Jones gave a nice stack trace pointing to the error. Issue is that tcp_get_metrics() should not be called with a NULL dst Fixes: 1fe4c481ba637 ("net-tcp: Fast Open client - cookie cache") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14tcp: tsq: restore minimal amount of queueingEric Dumazet
After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several users reported throughput regressions, notably on mvneta and wifi adapters. 802.11 AMPDU requires a fair amount of queueing to be effective. This patch partially reverts the change done in tcp_write_xmit() so that the minimal amount is sysctl_tcp_limit_output_bytes. It also remove the use of this sysctl while building skb stored in write queue, as TSO autosizing does the right thing anyway. Users with well behaving NICS and correct qdisc (like sch_fq), can then lower the default sysctl_tcp_limit_output_bytes value from 128KB to 8KB. This new usage of sysctl_tcp_limit_output_bytes permits each driver authors to check how their driver performs when/if the value is set to a minimum of 4KB. Normally, line rate for a single TCP flow should be possible, but some drivers rely on timers to perform TX completion and too long TX completion delays prevent reaching full throughput. Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sujith Manoharan <sujith@msujith.org> Reported-by: Arnaud Ebalard <arno@natisbad.org> Tested-by: Sujith Manoharan <sujith@msujith.org> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14core/dev: do not ignore dmac in dev_forward_skb()Alexei Starovoitov
commit 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") and refactoring 64261f230a91 ("dev: move skb_scrub_packet() after eth_type_trans()") are forcing pkt_type to be PACKET_HOST when skb traverses veth. which means that ip forwarding will kick in inside netns even if skb->eth->h_dest != dev->dev_addr Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb() and in ip_tunnel_rcv() Fixes: 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") CC: Isaku Yamahata <yamahatanetdev@gmail.com> CC: Maciej Zenczykowski <zenczykowski@gmail.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14Merge branch 'core-locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking changes from Ingo Molnar: "The biggest changes: - add lockdep support for seqcount/seqlocks structures, this unearthed both bugs and required extra annotation. - move the various kernel locking primitives to the new kernel/locking/ directory" * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) block: Use u64_stats_init() to initialize seqcounts locking/lockdep: Mark __lockdep_count_forward_deps() as static lockdep/proc: Fix lock-time avg computation locking/doc: Update references to kernel/mutex.c ipv6: Fix possible ipv6 seqlock deadlock cpuset: Fix potential deadlock w/ set_mems_allowed seqcount: Add lockdep functionality to seqcount/seqlock structures net: Explicitly initialize u64_stats_sync structures for lockdep locking: Move the percpu-rwsem code to kernel/locking/ locking: Move the lglocks code to kernel/locking/ locking: Move the rwsem code to kernel/locking/ locking: Move the rtmutex code to kernel/locking/ locking: Move the semaphore core to kernel/locking/ locking: Move the spinlock code to kernel/locking/ locking: Move the lockdep code to kernel/locking/ locking: Move the mutex code to kernel/locking/ hung_task debugging: Add tracepoint to report the hang x86/locking/kconfig: Update paravirt spinlock Kconfig description lockstat: Report avg wait and hold times lockdep, x86/alternatives: Drop ancient lockdep fixup message ...
2013-11-08inet: fix a UFO regressionEric Dumazet
While testing virtio_net and skb_segment() changes, Hannes reported that UFO was sending wrong frames. It appears this was introduced by a recent commit : 8c3a897bfab1 ("inet: restore gso for vxlan") The old condition to perform IP frag was : tunnel = !!skb->encapsulation; ... if (!tunnel && proto == IPPROTO_UDP) { So the new one should be : udpfrag = !skb->encapsulation && proto == IPPROTO_UDP; ... if (udpfrag) { Initialization of udpfrag must be done before call to ops->callbacks.gso_segment(skb, features), as skb_udp_tunnel_segment() clears skb->encapsulation (We want udpfrag to be true for UFO, false for VXLAN) With help from Alexei Starovoitov Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-06net: Explicitly initialize u64_stats_sync structures for lockdepJohn Stultz
In order to enable lockdep on seqcount/seqlock structures, we must explicitly initialize any locks. The u64_stats_sync structure, uses a seqcount, and thus we need to introduce a u64_stats_init() function and use it to initialize the structure. This unfortunately adds a lot of fairly trivial initialization code to a number of drivers. But the benefit of ensuring correctness makes this worth while. Because these changes are required for lockdep to be enabled, and the changes are quite trivial, I've not yet split this patch out into 30-some separate patches, as I figured it would be better to get the various maintainers thoughts on how to best merge this change along with the seqcount lockdep enablement. Feedback would be appreciated! Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: James Morris <jmorris@namei.org> Cc: Jesse Gross <jesse@nicira.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mirko Lindner <mlindner@marvell.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Roger Luethi <rl@hellgate.ch> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Simon Horman <horms@verge.net.au> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Wensong Zhang <wensong@linux-vs.org> Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-05ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACEHannes Frederic Sowa
Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04tcp: properly handle stretch acks in slow startYuchung Cheng
Slow start now increases cwnd by 1 if an ACK acknowledges some packets, regardless the number of packets. Consequently slow start performance is highly dependent on the degree of the stretch ACKs caused by receiver or network ACK compression mechanisms (e.g., delayed-ACK, GRO, etc). But slow start algorithm is to send twice the amount of packets of packets left so it should process a stretch ACK of degree N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A follow up patch will use the remainder of the N (if greater than 1) to adjust cwnd in the congestion avoidance phase. In addition this patch retires the experimental limited slow start (LSS) feature. LSS has multiple drawbacks but questionable benefit. The fractional cwnd increase in LSS requires a loop in slow start even though it's rarely used. Configuring such an increase step via a global sysctl on different BDPS seems hard. Finally and most importantly the slow start overshoot concern is now better covered by the Hybrid slow start (hystart) enabled by default. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04tcp: enable sockets to use MSG_FASTOPEN by defaultYuchung Cheng
Applications have started to use Fast Open (e.g., Chrome browser has such an optional flag) and the feature has gone through several generations of kernels since 3.7 with many real network tests. It's time to enable this flag by default for applications to test more conveniently and extensively. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables Pablo Neira Ayuso says: ==================== This batch contains fives nf_tables patches for your net-next tree, they are: * Fix possible use after free in the module removal path of the x_tables compatibility layer, from Dan Carpenter. * Add filter chain type for the bridge family, from myself. * Fix Kconfig dependencies of the nf_tables bridge family with the core, from myself. * Fix sparse warnings in nft_nat, from Tomasz Bursztyka. * Remove duplicated include in the IPv4 family support for nf_tables, from Wei Yongjun. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== This is another batch containing Netfilter/IPVS updates for your net-next tree, they are: * Six patches to make the ipt_CLUSTERIP target support netnamespace, from Gao feng. * Two cleanups for the nf_conntrack_acct infrastructure, introducing a new structure to encapsulate conntrack counters, from Holger Eitzenberger. * Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann. * Skip checksum recalculation in SCTP support for IPVS, also from Daniel Borkmann. * Fix behavioural change in xt_socket after IP early demux, from Florian Westphal. * Fix bogus large memory allocation in the bitmap port set type in ipset, from Jozsef Kadlecsik. * Fix possible compilation issues in the hash netnet set type in ipset, also from Jozsef Kadlecsik. * Define constants to identify netlink callback data in ipset dumps, again from Jozsef Kadlecsik. * Use sock_gen_put() in xt_socket to replace xt_socket_put_sk, from Eric Dumazet. * Improvements for the SH scheduler in IPVS, from Alexander Frolkin. * Remove extra delay due to unneeded rcu barrier in IPVS net namespace cleanup path, from Julian Anastasov. * Save some cycles in ip6t_REJECT by skipping checksum validation in packets leaving from our stack, from Stanislav Fomichev. * Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from Julian Anastasov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/netconsole.c net/bridge/br_private.h Three mostly trivial conflicts. The net/bridge/br_private.h conflict was a function signature (argument addition) change overlapping with the extern removals from Joe Perches. In drivers/net/netconsole.c we had one change adjusting a printk message whilst another changed "printk(KERN_INFO" into "pr_info(". Lastly, the emulex change was a new inline function addition overlapping with Joe Perches's extern removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-03netfilter: nf_tables: remove duplicated include from nf_tables_ipv4.cWei Yongjun
Remove duplicated include. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-02Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Conflicts: net/xfrm/xfrm_policy.c Minor merge conflict in xfrm_policy.c, consisting of overlapping changes which were trivial to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-02Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Fix a possible race on ipcomp scratch buffers because of too early enabled siftirqs. From Michal Kubecek. 2) The current xfrm garbage collector threshold is too small for some workloads, resulting in bad performance on these workloads. Increase the threshold from 1024 to 32768. 3) Some codepaths might not have a dst_entry attached to the skb when calling xfrm_decode_session(). So add a check to prevent a null pointer dereference in this case. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-01xfrm: Fix null pointer dereference when decoding sessionsSteffen Klassert
On some codepaths the skb does not have a dst entry when xfrm_decode_session() is called. So check for a valid skb_dst() before dereferencing the device interface index. We use 0 as the device index if there is no valid skb_dst(), or at reverse decoding we use skb_iif as device interface index. Bug was introduced with git commit bafd4bd4dc ("xfrm: Decode sessions with output interface."). Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-10-30Merge branch 'dma_complete' into nextVinod Koul
2013-10-29tcp: temporarily disable Fast Open on SYN timeoutYuchung Cheng
Fast Open currently has a fall back feature to address SYN-data being dropped but it requires the middle-box to pass on regular SYN retry after SYN-data. This is implemented in commit aab487435 ("net-tcp: Fast Open client - detecting SYN-data drops") However some NAT boxes will drop all subsequent packets after first SYN-data and blackholes the entire connections. An example is in commit 356d7d8 "netfilter: nf_conntrack: fix tcp_in_window for Fast Open". The sender should note such incidents and fall back to use the regular TCP handshake on subsequent attempts temporarily as well: after the second SYN timeouts the original Fast Open SYN is most likely lost. When such an event recurs Fast Open is disabled based on the number of recurrences exponentially. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-29net: esp{4,6}: get rid of struct esp_dataMathias Krause
struct esp_data consists of a single pointer, vanishing the need for it to be a structure. Fold the pointer into 'data' direcly, removing one level of pointer indirection. Signed-off-by: Mathias Krause <mathias.krause@secunet.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-10-29net: esp{4,6}: remove padlen from struct esp_dataMathias Krause
The padlen member of struct esp_data is always zero. Get rid of it. Signed-off-by: Mathias Krause <mathias.krause@secunet.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-10-29ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORKHannes Frederic Sowa
UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and IP_PMTUDISC_PROBE well enough. UFO enabled packet delivery just appends all frags to the cork and hands it over to the network card. So we just deliver non-DF udp fragments (DF-flag may get overwritten by hardware or virtual UFO enabled interface). UDP_CORK does enqueue the data until the cork is disengaged. At this point it sets the correct IP_DF and local_df flags and hands it over to ip_fragment which in this case will generate an icmp error which gets appended to the error socket queue. This is not reflected in the syscall error (of course, if UFO is enabled this also won't happen). Improve this by checking the pmtudisc flags before appending data to the socket and if we still can fit all data in one packet when IP_PMTUDISC_DO or IP_PMTUDISC_PROBE is set, only then proceed. We use (mtu-fragheaderlen) to check for the maximum length because we ensure not to generate a fragment and non-fragmented data does not need to have its length aligned on 64 bit boundaries. Also the passed in ip_options are already aligned correctly. Maybe, we can relax some other checks around ip_fragment. This needs more research. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-29tcp: gso: fix truesize trackingEric Dumazet
commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets") had an heuristic that can trigger a warning in skb_try_coalesce(), because skb->truesize of the gso segments were exactly set to mss. This breaks the requirement that skb->truesize >= skb->len + truesizeof(struct sk_buff); It can trivially be reproduced by : ifconfig lo mtu 1500 ethtool -K lo tso off netperf As the skbs are looped into the TCP networking stack, skb_try_coalesce() warns us of these skb under-estimating their truesize. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-28xfrm: Increase the garbage collector thresholdSteffen Klassert
With the removal of the routing cache, we lost the option to tweak the garbage collector threshold along with the maximum routing cache size. So git commit 703fb94ec ("xfrm: Fix the gc threshold value for ipv4") moved back to a static threshold. It turned out that the current threshold before we start garbage collecting is much to small for some workloads, so increase it from 1024 to 32768. This means that we start the garbage collector if we have more than 32768 dst entries in the system and refuse new allocations if we are above 65536. Reported-by: Wolfgang Walter <linux@stwm.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-10-28inet: restore gso for vxlanEric Dumazet
Alexei reported a performance regression on vxlan, caused by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable" GSO vxlan packets were not properly segmented, adding IP fragments while they were not expected. Rename 'bool tunnel' to 'bool encap', and add a new boolean to express the fact that UDP should be fragmented. This fragmentation is triggered by skb->encapsulation being set. Remove a "skb->encapsulation = 1" added in above commit, as its not needed, as frags inherit skb->frag from original GSO skb. Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-27tcp: do not rearm RTO when future data are sackedYuchung Cheng
Patch ed08495c3 "tcp: use RTT from SACK for RTO" always re-arms RTO upon obtaining a RTT sample from newly sacked data. But technically RTO should only be re-armed when the data sent before the last (re)transmission of write queue head are (s)acked. Otherwise the RTO may continue to extend during loss recovery on data sent in the future. Note that RTTs from ACK or timestamps do not have this problem, as the RTT source must be from data sent before. The new RTO re-arm policy is 1) Always re-arm RTO if SND.UNA is advanced 2) Re-arm RTO if sack RTT is available, provided the sacked data was sent before the last time write_queue_head was sent. Signed-off-by: Larry Brakmo <brakmo@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-27tcp: only take RTT from timestamps if new data is ackedYuchung Cheng
Patch ed08495c3 "tcp: use RTT from SACK for RTO" has a bug that it does not check if the ACK acknowledge new data before taking the RTT sample from TCP timestamps. This patch adds the check back as required by the RFC. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-27tcp: fix SYNACK RTT estimation in Fast OpenYuchung Cheng
tp->lsndtime may not always be the SYNACK timestamp if a passive Fast Open socket sends data before handshake completes. And if the remote acknowledges both the data and the SYNACK, the RTT sample is already taken in tcp_ack(), so no need to call tcp_update_ack_rtt() in tcp_synack_rtt_meas() aagain. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-25net: use DMA_COMPLETE for dma completion statusVinod Koul
Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>