summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)Author
2013-07-01Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/vxlan-next Stephen Hemminger says: ==================== Here is current updates for vxlan in net-next. It includes Mike's changes to handle multiple destinations and lots of little cosmetic stuff. This is a fresh vxlan-next repository which was forked from net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01x25: Fix broken locking in ioctl error paths.Dave Jones
Two of the x25 ioctl cases have error paths that break out of the function without unlocking the socket, leading to this warning: ================================================ [ BUG: lock held when returning to user space! ] 3.10.0-rc7+ #36 Not tainted ------------------------------------------------ trinity-child2/31407 is leaving the kernel with locks still held! 1 lock held by trinity-child2/31407: #0: (sk_lock-AF_X25){+.+.+.}, at: [<ffffffffa024b6da>] x25_ioctl+0x8a/0x740 [x25] Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01netem: use rb tree to implement the time queueEric Dumazet
Following typical setup to implement a ~100 ms RTT and big amount of reorders has very poor performance because netem implements the time queue using a linked list. ----------------------------------------------------------- ETH=eth0 IFB=ifb0 modprobe ifb ip link set dev $IFB up tc qdisc add dev $ETH ingress 2>/dev/null tc filter add dev $ETH parent ffff: \ protocol ip u32 match u32 0 0 flowid 1:1 action mirred egress \ redirect dev $IFB ethtool -K $ETH gro off tso off gso off tc qdisc add dev $IFB root netem delay 50ms 10ms limit 100000 tc qd add dev $ETH root netem delay 50ms limit 100000 --------------------------------------------------------- Switch netem time queue to a rb tree, so this kind of setup can work at high speed. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01sunrpc: Don't schedule an upcall on a replaced cache entry.NeilBrown
When a cache entry is replaced, the "expiry_time" get set to zero by a call to "cache_fresh_locked(..., 0)" at the end of "sunrpc_cache_update". This low expiry time makes cache_check() think that the 'refresh_age' is negative, so the 'age' is comparatively large and a refresh is triggered. However refreshing a replaced entry it pointless, it cannot achieve anything useful. So teach cache_check to ignore a low refresh_age when expiry_time is zero. Reported-by: Bodo Stroesser <bstroesser@ts.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01net/sunrpc: xpt_auth_cache should be ignored when expired.NeilBrown
commit d202cce8963d9268ff355a386e20243e8332b308 sunrpc: never return expired entries in sunrpc_cache_lookup moved the 'entry is expired' test from cache_check to sunrpc_cache_lookup, so that it happened early and some races could safely be ignored. However the ip_map (in svcauth_unix.c) has a separate single-item cache which allows quick lookup without locking. An entry in this case would not be subject to the expiry test and so could be used well after it has expired. This is not normally a big problem because the first time it is used after it is expired an up-call will be scheduled to refresh the entry (if it hasn't been scheduled already) and the old entry will then be invalidated. So on the second attempt to use it after it has expired, ip_map_cached_get will discard it. However that is subtle and not ideal, so replace the "!cache_valid" test with "cache_is_expired". In doing this we drop the test on the "CACHE_VALID" bit. This is unnecessary as the bit is never cleared, and an entry will only be cached if the bit is set. Reported-by: Bodo Stroesser <bstroesser@ts.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01sunrpc/cache: ensure items removed from cache do not have pending upcalls.NeilBrown
It is possible for a race to set CACHE_PENDING after cache_clean() has removed a cache entry from the cache. If CACHE_PENDING is still set when the entry is finally 'put', the cache_dequeue() will never happen and we can leak memory. So set a new flag 'CACHE_CLEANED' when we remove something from the cache, and don't queue any upcall if it is set. If CACHE_PENDING is set before CACHE_CLEANED, the call that cache_clean() makes to cache_fresh_unlocked() will free memory as needed. If CACHE_PENDING is set after CACHE_CLEANED, the test in sunrpc_cache_pipe_upcall will ensure that the memory is not allocated. Reported-by: <bstroesser@ts.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01sunrpc/cache: use cache_fresh_unlocked consistently and correctly.NeilBrown
cache_fresh_unlocked() is called when a cache entry has been updated and ensures that if there were any pending upcalls, they are cleared. So every time we update a cache entry, we should call this, and this should be the only way that we try to clear pending calls (that sort of uniformity makes code sooo much easier to read). try_to_negate_entry() will (possibly) mark an entry as negative. If it doesn't, it is because the entry already is VALID. So the entry will be valid on exit, so it is appropriate to call cache_fresh_unlocked(). So tidy up try_to_negate_entry() to do that, and remove partial open-coded cache_fresh_unlocked() from the one call-site of try_to_negate_entry(). In the other branch of the 'switch(cache_make_upcall())', we again have a partial open-coded version of cache_fresh_unlocked(). Replace that with a real call. And again in cache_clean(), use a real call to cache_fresh_unlocked(). These call sites might previously have called cache_revisit_request() if CACHE_PENDING wasn't set. This is never necessary because cache_revisit_request() can only do anything if the item is in the cache_defer_hash, However any time that an item is added to the cache_defer_hash (setup_deferral), the code immediately tests CACHE_PENDING, and removes the entry again if it is clear. So all other places we only need to 'cache_revisit_request' if we've just cleared CACHE_PENDING. Reported-by: Bodo Stroesser <bstroesser@ts.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01sunrpc/cache: remove races with queuing an upcall.NeilBrown
We currently queue an upcall after setting CACHE_PENDING, and dequeue after clearing CACHE_PENDING. So a request should only be present when CACHE_PENDING is set. However we don't combine the test and the enqueue/dequeue in a protected region, so it is possible (if unlikely) for a race to result in a request being queued without CACHE_PENDING set, or a request to be absent despite CACHE_PENDING. So: include a test for CACHE_PENDING inside the regions of enqueue and dequeue where queue_lock is held, and abort the operation if the value is not as expected. Also remove the early 'return' from cache_dequeue() to ensure that it always removes all entries: As there is no locking between setting CACHE_PENDING and calling sunrpc_cache_pipe_upcall it is not inconceivable for some other thread to clear CACHE_PENDING and then someone else to set it and call sunrpc_cache_pipe_upcall, both before the original threads completed the call. With this, it perfectly safe and correct to: - call cache_dequeue() if and only if we have just cleared CACHE_PENDING - call sunrpc_cache_pipe_upcall() (via cache_make_upcall) if and only if we have just set CACHE_PENDING. Reported-by: Bodo Stroesser <bstroesser@ts.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Bodo Stroesser <bstroesser@ts.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01svcrpc: don't error out on small tcp fragmentJ. Bruce Fields
Though clients we care about mostly don't do this, it is possible for rpc requests to be sent in multiple fragments. Here we have a sanity check to ensure that the final received rpc isn't too small--except that the number we're actually checking is the length of just the final fragment, not of the whole rpc. So a perfectly legal rpc that's unluckily fragmented could cause the server to close the connection here. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01svcrpc: fix handling of too-short rpc'sJ. Bruce Fields
If we detect that an rpc is too short, we abort and close the connection. Except, there's a bug here: we're leaving sk_datalen nonzero without leaving any pages in the sk_pages array. The most likely result of the inconsistency is a subsequent crash in svc_tcp_clear_pages. Also demote the BUG_ON in svc_tcp_clear_pages to a WARN. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01svcrpc: store gss mech in svc_credJ. Bruce Fields
Store a pointer to the gss mechanism used in the rq_cred and cl_cred. This will make it easier to enforce SP4_MACH_CRED, which needs to compare the mechanism used on the exchange_id with that used on protected operations. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01svcrpc: introduce init_svc_credJ. Bruce Fields
Common helper to zero out fields of the svc_cred. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-07-01Merge branch 'for-3.10' into 'for-3.11'J. Bruce Fields
Merge bugfixes into my for-3.11 branch.
2013-07-01neighbour: fix a race in neigh_destroy()Eric Dumazet
There is a race in neighbour code, because neigh_destroy() uses skb_queue_purge(&neigh->arp_queue) without holding neighbour lock, while other parts of the code assume neighbour rwlock is what protects arp_queue Convert all skb_queue_purge() calls to the __skb_queue_purge() variant Use __skb_queue_head_init() instead of skb_queue_head_init() to make clear we do not use arp_queue.lock And hold neigh->lock in neigh_destroy() to close the race. Reported-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01ipv6: fix ecmp lookup when oif is specifiedNicolas Dichtel
There is no reason to skip ECMP lookup when oif is specified, but this implies to check oif given by user when selecting another route. When the new route does not match oif requirement, we simply keep the initial one. Spotted-by: dingzhi <zhi.ding@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01ipv6: only apply anti-spoofing checks to not-pointopoint tunnelsHannes Frederic Sowa
Because of commit 218774dc341f219bfcf940304a081b121a0e8099 ("ipv6: add anti-spoofing checks for 6to4 and 6rd") the sit driver dropped packets for 2002::/16 destinations and sources even when configured to work as a tunnel with fixed endpoint. We may only apply the 6rd/6to4 anti-spoofing checks if the device is not in pointopoint mode. This was an oversight from me in the above commit, sorry. Thanks to Roman Mamedov for reporting this! Reported-by: Roman Mamedov <rm@romanrm.ru> Cc: David Miller <davem@davemloft.net> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== Yet one more pull request for wireless updates intended for 3.11... For the mac80211 bits, Johannes says: "Here we have a few memory leak fixes related to BSS struct handling mostly from Ben, including a fix for a more theoretical problem (associating while a BSS struct times out) from myself, a compilation warning fix from Arend, mesh fixes from Thomas, tracking the beacon bitrate (Alex), a bandwidth change event fix (Ilan) and some initial work for 5/10 MHz channels from Simon." Regarding the iwlwifi bits, Johannes says: "Emmanuel removed some unneeded/unsupported module parameters and adds a Bluetooth 1x1 lookup-table for some upcoming products. From Alex I have an older patch to add low-power receive support, this depended on a mac80211 commit that only just came in with the merge from wireless-next I did. Ilan made beacon timings better, and Eytan added some debug statements for thermal throttling. I have a few cleanups, a fix for a long-standing but rare warning, and, arguably the most important patch here, the firmware API version bump for the 7260/3160 devices." Also included is a Bluetooth pull -- Gustavo says: "Here goes a set of patches to 3.11. The biggest work here is from Andre Guedes on the move of the Discovery to use the new request framework. Other than that Johan provided a bunch of fixes to the L2CAP code. The rest are just small fixes and clean ups." On top of all that, there are a variety of updates and fixes to brcmfmac, rt2x00, wil6210, ath9k, ath10k, and a few others here and there. This also includes a pull of the wireless tree, in order to prevent some merge conflicts. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01openvswitch: Add Kconfig dependency on GRE-DEMUX.Pravin B Shelar
Openvswitch uses function from NET_IPGRE_DEMUX module. Add Kconfig dependency to fix following compilation errors: http://marc.info/?l=linux-netdev&m=137244035226634 CC: Jesse Gross <jesse@nicira.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Pravin Shelar <pshelar@nicira.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-30Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== The following batch contains Netfilter/IPVS updates for net-next, they are: * Enforce policy to several nfnetlink subsystem, from Daniel Borkmann. * Use xt_socket to match the third packet (to perform simplistic socket-based stateful filtering), from Eric Dumazet. * Avoid large timeout for picked up from the middle TCP flows, from Florian Westphal. * Exclude IPVS from struct net if IPVS is disabled and removal of unnecessary included header file, from JunweiZhang. * Release SCTP connection immediately under load, to mimic current TCP behaviour, from Julian Anastasov. * Replace and enhance SCTP state machine, from Julian Anastasov. * Add tweak to reduce sync traffic in the presence of persistence, also from Julian Anastasov. * Add tweak for the IPVS SH scheduler not to reject connections directed to a server, choose a new one instead, from Alexander Frolkin. * Add support for sloppy TCP and SCTP modes, that creates state information on any packet, not only initial handshake packets, from Alexander Frolkin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-30netfilter: nf_queue: add NFQA_SKB_CSUM_NOTVERIFIED info flagFlorian Westphal
The common case is that TCP/IP checksums have already been verified, e.g. by hardware (rx checksum offload), or conntrack. Userspace can use this flag to determine when the checksum has not been validated yet. If the flag is set, this doesn't necessarily mean that the packet has an invalid checksum, e.g. if NIC doesn't support rx checksum. Userspace that sucessfully enabled NFQA_CFG_F_GSO queue feature flag can infer that IP/TCP checksum has already been validated if either the SKB_INFO attribute is not present or the NFQA_SKB_CSUM_NOTVERIFIED flag is unset. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-06-29more open-coded file_inode() callsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-28ipv4: use next hop exceptions also for input routesTimo Teräs
Commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops) assmued that "locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us". However, it seems that tunnel devices do trigger PMTU events in certain cases. At least ip_gre, ip6_gre, sit, and ipip do use the inner flow's skb_dst(skb)->ops->update_pmtu to propage mtu information from the outer flows. These can cause the inner flow mtu to be decreased. If next hop exceptions are not consulted for pmtu, IP fragmentation will not be done properly for these routes. It also seems that we really need to have the PMTU information always for netfilter TCPMSS clamp-to-pmtu feature to work properly. So for the time being, cache separate copies of input routes for each next hop exception. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-28ipv6: resend MLD report if a link-local address completes DADHannes Frederic Sowa
RFC3590/RFC3810 specifies we should resend MLD reports as soon as a valid link-local address is available. We now use the valid_ll_addr_cnt to check if it is necessary to resend a new report. Changes since Flavio Leitner's version: a) adapt for valid_ll_addr_cnt b) resend first reports directly in the path and just arm the timer for mc_qrv-1 resends. Reported-by: Flavio Leitner <fleitner@redhat.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Stevens <dlstevens@us.ibm.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-28ipv6: introduce per-interface counter for dad-completed ipv6 addressesHannes Frederic Sowa
To reduce the number of unnecessary router solicitations, MLDv2 and IGMPv3 messages we need to track the number of valid (as in non-optimistic, no-dad-failed and non-tentative) link-local addresses. Therefore, this patch implements a valid_ll_addr_cnt in struct inet6_dev. We now only emit router solicitations if the first link-local address finishes duplicate address detection. The changes for MLDv2 and IGMPv3 are in a follow-up patch. While there, also simplify one if statement(one minor nit I made in one of my previous patches): if (!...) do(); else return; <<into>> if (...) return; do(); Cc: Flavio Leitner <fbl@redhat.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: David Stevens <dlstevens@us.ibm.com> Suggested-by: David Stevens <dlstevens@us.ibm.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-28SUNRPC: PipeFS MOUNT notification optimization for dying clientsStanislav Kinsbursky
Not need to create pipes for dying client. So just skip them. Note: we can safely dereference the client structure, because notification caller is holding sn->pipefs_sb_lock. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28SUNRPC: split client creation routine into setup and registrationStanislav Kinsbursky
This helper moves all "registration" code to the new rpc_client_register() helper. This helper will be used later in the series to synchronize against PipeFS MOUNT/UMOUNT events. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28SUNRPC: fix races on PipeFS UMOUNT notificationsStanislav Kinsbursky
CPU#0 CPU#1 ----------------------------- ----------------------------- rpc_kill_sb sn->pipefs_sb = NULL rpc_release_client (UMOUNT_EVENT) rpc_free_auth rpc_pipefs_event rpc_get_client_for_event !atomic_inc_not_zero(cl_count) <skip the client> atomic_inc(cl_count) rpc_free_client rpc_clnt_remove_pipedir <skip client dir removing> To fix this, this patch does the following: 1) Calls RPC_PIPEFS_UMOUNT notification with sn->pipefs_sb_lock being held. 2) Removes SUNRPC client from the list AFTER pipes destroying. 3) Doesn't hold RPC client on notification: if client in the list, then it can't be destroyed while sn->pipefs_sb_lock in hold by notification caller. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28SUNRPC: fix races on PipeFS MOUNT notificationsStanislav Kinsbursky
Below are races, when RPC client can be created without PiepFS dentries CPU#0 CPU#1 ----------------------------- ----------------------------- rpc_new_client rpc_fill_super rpc_setup_pipedir mutex_lock(&sn->pipefs_sb_lock) rpc_get_sb_net == NULL (no per-net PipeFS superblock) sn->pipefs_sb = sb; notifier_call_chain(MOUNT) (client is not in the list) rpc_register_client (client without pipes dentries) To fix this patch: 1) makes PipeFS mount notification call with pipefs_sb_lock being held. 2) releases pipefs_sb_lock on new SUNRPC client creation only after registration. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-28Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem Conflicts: net/wireless/nl80211.c
2013-06-28Merge branch 'freezer'Rafael J. Wysocki
* freezer: af_unix: use freezable blocking calls in read sigtimedwait: use freezable blocking call nanosleep: use freezable blocking call futex: use freezable blocking call select: use freezable blocking call epoll: use freezable blocking call binder: use freezable blocking calls freezer: add new freezable helpers using freezer_do_not_count() freezer: convert freezable helpers to static inline where possible freezer: convert freezable helpers to freezer_do_not_count() freezer: skip waking up tasks with PF_FREEZER_SKIP set freezer: shorten freezer sleep time using exponential backoff lockdep: check that no locks held at freeze time lockdep: remove task argument from debug_check_no_locks_held freezer: add unsafe versions of freezable helpers for CIFS freezer: add unsafe versions of freezable helpers for NFS
2013-06-27netlink: fix splat in skb_clone with large messagesPablo Neira
Since (c05cdb1 netlink: allow large data transfers from user-space), netlink splats if it invokes skb_clone on large netlink skbs since: * skb_shared_info was not correctly initialized. * skb->destructor is not set in the cloned skb. This was spotted by trinity: [ 894.990671] BUG: unable to handle kernel paging request at ffffc9000047b001 [ 894.991034] IP: [<ffffffff81a212c4>] skb_clone+0x24/0xc0 [...] [ 894.991034] Call Trace: [ 894.991034] [<ffffffff81ad299a>] nl_fib_input+0x6a/0x240 [ 894.991034] [<ffffffff81c3b7e6>] ? _raw_read_unlock+0x26/0x40 [ 894.991034] [<ffffffff81a5f189>] netlink_unicast+0x169/0x1e0 [ 894.991034] [<ffffffff81a601e1>] netlink_sendmsg+0x251/0x3d0 Fix it by: 1) introducing a new netlink_skb_clone function that is used in nl_fib_input, that sets our special skb->destructor in the cloned skb. Moreover, handle the release of the large cloned skb head area in the destructor path. 2) not allowing large skbuffs in the netlink broadcast path. I cannot find any reasonable use of the large data transfer using netlink in that path, moreover this helps to skip extra skb_clone handling. I found two more netlink clients that are cloning the skbs, but they are not in the sendmsg path. Therefore, the sole client cloning that I found seems to be the fib frontend. Thanks to Eric Dumazet for helping to address this issue. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-27sit: add support of x-netnsNicolas Dichtel
This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-27dev: introduce skb_scrub_packet()Nicolas Dichtel
The goal of this new function is to perform all needed cleanup before sending an skb into another netns. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26ipv6: rearm router solicitaion timer when setting new tokenized addressHannes Frederic Sowa
When a new tokenized address gets installed we send out just one router solicition. We should send out `rtr_solicits' in case one router advertisment got lost. So, rearm the timer as we do in addrconf_dad_complete. Cc: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26af_key: fix info leaks in notify messagesMathias Krause
key_notify_sa_flush() and key_notify_policy_flush() miss to initialize the sadb_msg_reserved member of the broadcasted message and thereby leak 2 bytes of heap memory to listeners. Fix that. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26ipv6: ip6_sk_dst_check() must not assume ipv6 dstEric Dumazet
It's possible to use AF_INET6 sockets and to connect to an IPv4 destination. After this, socket dst cache is a pointer to a rtable, not rt6_info. ip6_sk_dst_check() should check the socket dst cache is IPv6, or else various corruptions/crashes can happen. Dave Jones can reproduce immediate crash with trinity -q -l off -n -c sendmsg -c connect With help from Hannes Frederic Sowa Reported-by: Dave Jones <davej@redhat.com> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26net: fix kernel deadlock with interface rename and netdev name retrieval.Nicolas Schichan
When the kernel (compiled with CONFIG_PREEMPT=n) is performing the rename of a network interface, it can end up waiting for a workqueue to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to the fact that read_secklock_begin() will spin forever waiting for the writer process (the one doing the interface rename) to update the devnet_rename_seq sequence. This patch fixes the problem by adding a helper (netdev_get_name()) and using it in the code handling the SIOCGIFNAME ioctl and SO_BINDTODEVICE setsockopt. The netdev_get_name() helper uses raw_seqcount_begin() to avoid spinning forever, waiting for devnet_rename_seq->sequence to become even. cond_resched() is used in the contended case, before retrying the access to give the writer process a chance to finish. The use of raw_seqcount_begin() will incur some unneeded work in the reader process in the contended case, but this is better than deadlocking the system. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26sit: fix 4in4 + IPsec scenarioNicolas Dichtel
Since commit 32b8a8e59c9c "sit: add IPv4 over IPv4 support", tunnel->parms.iph.protocol is 0 when both 4in4 and 6in4 are setup, but xfrm_lookup() is called only when proto is != 0, thus we need to pass the real value. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== Just one patch this time. 1) Drop packets when the matching SA is in larval state and add a statistic counter for that. From Fan Du. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
2013-06-26ipvs: add sync_persist_mode flagJulian Anastasov
Add sync_persist_mode flag to reduce sync traffic by syncing only persistent templates. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: SH fallback and L4 hashingAlexander Frolkin
By default the SH scheduler rejects connections that are hashed onto a realserver of weight 0. This patch adds a flag to make SH choose a different realserver in this case, instead of rejecting the connection. The patch also adds a flag to make SH include the source port (TCP, UDP, SCTP) in the hash as well as the source address. This basically allows for deterministic round-robin load balancing (i.e., where any director in a cluster of directors with identical config will send the same packet the same way). The flags are service flags (IP_VS_SVC_F_SCHED*) so that these options can be set per service. They are set using a new option to ipvsadm. Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: drop SCTP connections depending on stateJulian Anastasov
Drop SCTP connections under load (dropentry context) depending on the protocol state, just like for TCP: INIT conns are dropped immediately, established are dropped randomly while connections in progress or shutdown are skipped. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: replace the SCTP state machineJulian Anastasov
Convert the SCTP state table, so that it is more readable. Change the states to be according to the diagram in RFC 2960 and add more states suitable for middle box. Still, such change in states adds incompatibility if systems in sync setup include this change and others do not include it. With this change we also have proper transitions in INPUT-ONLY mode (DR/TUN) where we see packets only from client. Now we should not switch to 10-second CLOSED state at a time when we should stay in ESTABLISHED state. The short names for states are because we have 16-char space in ipvsadm and 11-char limit for the connection list format. It is a sequence of the TCP implementation where the longest state name is ESTABLISHED. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: sloppy TCP and SCTPAlexander Frolkin
This adds support for sloppy TCP and SCTP modes to IPVS. When enabled (sysctls net.ipv4.vs.sloppy_tcp and net.ipv4.vs.sloppy_sctp), allows IPVS to create connection state on any packet, not just a TCP SYN (or SCTP INIT). This allows connections to fail over from one IPVS director to another mid-flight. Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26ipvs: provide iph to schedulersJulian Anastasov
Before now the schedulers needed access only to IP addresses and it was easy to get them from skb by using ip_vs_fill_iph_addr_only. New changes for the SH scheduler will need the protocol and ports which is difficult to get from skb for the IPv6 case. As we have all the data in the iph structure, to avoid the same slow lookups provide the iph to schedulers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-25Merge ../vxlan-xStephen Hemminger
2013-06-25bridge: check for zero ether address in fdb addStephen Hemminger
The check for all-zero ether address was removed from rtnetlink core, since Vxlan uses all-zero ether address to signify default address. Need to add check back in for bridge. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
2013-06-25net: poll/select low latency socket supportEliezer Tamir
select/poll busy-poll support. Split sysctl value into two separate ones, one for read and one for poll. updated Documentation/sysctl/net.txt Add a new poll flag POLL_LL. When this flag is set, sock_poll will call sk_poll_ll if possible. sock_poll sets this flag in its return value to indicate to select/poll when a socket that can busy poll is found. When poll/select have nothing to report, call the low-level sock_poll again until we are out of time or we find something. Once the system call finds something, it stops setting POLL_LL, so it can return the result to the user ASAP. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25net: sctp: simplify sctp_get_portDaniel Borkmann
No need to have an extra ret variable when we directly can return the value of sctp_get_port_local(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>