summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)Author
2014-08-04bridge: remove a useless commentMichael S. Tsirkin
commit 6cbdceeb1cb12c7d620161925a8c3e81daadb2e4 bridge: Dump vlan information from a bridge port introduced a comment in an attempt to explain the code logic. The comment is unfinished so it confuses more than it explains, remove it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04Merge branch 'for-3.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Mostly changes to get the v2 interface ready. The core features are mostly ready now and I think it's reasonable to expect to drop the devel mask in one or two devel cycles at least for a subset of controllers. - cgroup added a controller dependency mechanism so that block cgroup can depend on memory cgroup. This will be used to finally support IO provisioning on the writeback traffic, which is currently being implemented. - The v2 interface now uses a separate table so that the interface files for the new interface are explicitly declared in one place. Each controller will explicitly review and add the files for the new interface. - cpuset is getting ready for the hierarchical behavior which is in the similar style with other controllers so that an ancestor's configuration change doesn't change the descendants' configurations irreversibly and processes aren't silently migrated when a CPU or node goes down. All the changes are to the new interface and no behavior changed for the multiple hierarchies" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits) cpuset: fix the WARN_ON() in update_nodemasks_hier() cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup: distinguish the default and legacy hierarchies when handling cftypes cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[] cpuset: export effective masks to userspace cpuset: allow writing offlined masks to cpuset.cpus/mems cpuset: enable onlined cpu/node in effective masks cpuset: refactor cpuset_hotplug_update_tasks() cpuset: make cs->{cpus, mems}_allowed as user-configured masks cpuset: apply cs->effective_{cpus,mems} cpuset: initialize top_cpuset's configured masks at mount cpuset: use effective cpumask to build sched domains cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty cpuset: update cs->effective_{cpus, mems} when config changes cpuset: update cpuset->effective_{cpus,mems} at hotplug cpuset: add cs->effective_cpus and cs->effective_mems cgroup: clean up sane_behavior handling ...
2014-08-04batman-adv: prefer kmalloc_array to kmalloc when possibleAntonio Quartulli
Reported by checkpatch with the following warning: WARNING: Prefer kmalloc_array over kmalloc with multiply Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-08-04SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcredNeilBrown
current_cred() can only be changed by 'current', and cred->group_info is never changed. If a new group_info is needed, a new 'cred' is created. Consequently it is always safe to access current_cred()->group_info without taking any further references. So drop the refcounting and the incorrect rcu_dereference(). Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03sunrpc/auth: allow lockless (rcu) lookup of credential cache.NeilBrown
The new flag RPCAUTH_LOOKUP_RCU to credential lookup avoids locking, does not take a reference on the returned credential, and returns -ECHILD if a simple lookup was not possible. The returned value can only be used within an rcu_read_lock protected region. The main user of this is the new rpc_lookup_cred_nonblock() which returns a pointer to the current credential which is only rcu-safe (no ref-count held), and might return -ECHILD if allocation was required. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03sunrpc: remove "ec" argument from encrypt_v2 operationJeff Layton
It's always 0. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.cJeff Layton
Fix the endianness handling in gss_wrap_kerberos_v1 and drop the memset call there in favor of setting the filler bytes directly. In gss_wrap_kerberos_v2, get rid of the "ec" variable which is always zero, and drop the endianness conversion of 0. Sparse handles 0 as a special case, so it's not necessary. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03sunrpc: clean up sparse endianness warnings in gss_krb5_seal.cJeff Layton
Use u16 pointer in setup_token and setup_token_v2. None of the fields are actually handled as __be16, so this simplifies the code a bit. Also get rid of some unneeded pointer increments. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03sunrpc: fix RCU handling of gc_ctx fieldJeff Layton
The handling of the gc_ctx pointer only seems to be partially RCU-safe. The assignment and freeing are done using RCU, but many places in the code seem to dereference that pointer without proper RCU safeguards. Fix them to use rcu_dereference and to rcu_read_lock/unlock, and to properly handle the case where the pointer is NULL. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03Merge branch 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma ↵Trond Myklebust
into linux-next * 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (916 commits) xprtrdma: Handle additional connection events xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro xprtrdma: Make rpcrdma_ep_disconnect() return void xprtrdma: Schedule reply tasklet once per upcall xprtrdma: Allocate each struct rpcrdma_mw separately xprtrdma: Rename frmr_wr xprtrdma: Disable completions for LOCAL_INV Work Requests xprtrdma: Disable completions for FAST_REG_MR Work Requests xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external() xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work Request xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnect xprtrdma: Properly handle exhaustion of the rb_mws list xprtrdma: Chain together all MWs in same buffer pool xprtrdma: Back off rkey when FAST_REG_MR fails xprtrdma: Unclutter struct rpcrdma_mr_seg xprtrdma: Don't invalidate FRMRs if registration fails xprtrdma: On disconnect, don't ignore pending CQEs xprtrdma: Update rkeys after transport reconnect xprtrdma: Limit data payload size for ALLPHYSICAL xprtrdma: Protect ia->ri_id when unmapping/invalidating MRs ...
2014-08-03SUNRPC: Enforce an upper limit on the number of cached credentialsTrond Myklebust
In some cases where the credentials are not often reused, we may want to limit their total number just in order to make the negative lookups in the hash table more manageable. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-02nftables: Convert nft_hash to use generic rhashtableThomas Graf
The sizing of the hash table and the practice of requiring a lookup to retrieve the pprev to be stored in the element cookie before the deletion of an entry is left intact. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Patrick McHardy <kaber@trash.net> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02netlink: Convert netlink_lookup() to use RCU protected hash tableThomas Graf
Heavy Netlink users such as Open vSwitch spend a considerable amount of time in netlink_lookup() due to the read-lock on nl_table_lock. Use of RCU relieves the lock contention. Makes use of the new resizable hash table to avoid locking on the lookup. The hash table will grow if entries exceeds 75% of table size up to a total table size of 64K. It will automatically shrink if usage falls below 30%. Also splits nl_table_lock into a separate mutex to protect hash table mutations and allow synchronize_rcu() to sleep while waiting for readers during expansion and shrinking. Before: 9.16% kpktgend_0 [openvswitch] [k] masked_flow_lookup 6.42% kpktgend_0 [pktgen] [k] mod_cur_headers 6.26% kpktgend_0 [pktgen] [k] pktgen_thread_worker 6.23% kpktgend_0 [kernel.kallsyms] [k] memset 4.79% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup 4.37% kpktgend_0 [kernel.kallsyms] [k] memcpy 3.60% kpktgend_0 [openvswitch] [k] ovs_flow_extract 2.69% kpktgend_0 [kernel.kallsyms] [k] jhash2 After: 15.26% kpktgend_0 [openvswitch] [k] masked_flow_lookup 8.12% kpktgend_0 [pktgen] [k] pktgen_thread_worker 7.92% kpktgend_0 [pktgen] [k] mod_cur_headers 5.11% kpktgend_0 [kernel.kallsyms] [k] memset 4.11% kpktgend_0 [openvswitch] [k] ovs_flow_extract 4.06% kpktgend_0 [kernel.kallsyms] [k] _raw_spin_lock 3.90% kpktgend_0 [kernel.kallsyms] [k] jhash2 [...] 0.67% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Maintain all DSCP and ECN bits for IPv6 tun forwarding. This resolves an inconsistency between IPv4 and IPv6 behaviour. Patch from Alex Gartrell via Simon Horman. 2) Fix unnoticeable blink in xt_LED when the led-always-blink option is used, from Jiri Prchal. 3) Add missing return in nft_del_setelem(), otherwise this results in a double call of nft_data_uninit() in the nf_tables code, from Thomas Graf. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02ipv6: data of fwmark_reflect sysctl needs to be updated on netns constructionHannes Frederic Sowa
Fixes: e110861f86094cd ("net: add a sysctl to reflect the fwmark on replies") Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: use kmem_cache for inet_frag_queueNikolay Aleksandrov
Use kmem_cache to allocate/free inet_frag_queue objects since they're all the same size per inet_frags user and are alloced/freed in high volumes thus making it a perfect case for kmem_cache. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: use INET_FRAG_EVICTED to prevent icmp messagesNikolay Aleksandrov
Now that we have INET_FRAG_EVICTED we might as well use it to stop sending icmp messages in the "frag_expire" functions instead of stripping INET_FRAG_FIRST_IN from their flags when evicting. Also fix the comment style in ip6_expire_frag_queue(). Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: fix function declaration alignments in inet_fragmentNikolay Aleksandrov
Fix a couple of functions' declaration alignments. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: rename last_in to flagsNikolay Aleksandrov
The last_in field has been used to store various flags different from first/last frag in so give it a more descriptive name: flags. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: use INC_STATS_BH in the ipv6 reassembly codeNikolay Aleksandrov
Softirqs are already disabled so no need to do it again, thus let's be consistent and use the IP6_INC_STATS_BH variant. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02ipv4: remove nested rcu_read_lock/unlockDuan Jiong
ip_local_deliver_finish() already have a rcu_read_lock/unlock, so the rcu_read_lock/unlock is unnecessary. See the stack below: ip_local_deliver_finish | | ->icmp_rcv | | ->icmp_socket_deliver Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02net: filter: split 'struct sk_filter' into socket and bpf partsAlexei Starovoitov
clean up names related to socket filtering and bpf in the following way: - everything that deals with sockets keeps 'sk_*' prefix - everything that is pure BPF is changed to 'bpf_*' prefix split 'struct sk_filter' into struct sk_filter { atomic_t refcnt; struct rcu_head rcu; struct bpf_prog *prog; }; and struct bpf_prog { u32 jited:1, len:31; struct sock_fprog_kern *orig_prog; unsigned int (*bpf_func)(const struct sk_buff *skb, const struct bpf_insn *filter); union { struct sock_filter insns[0]; struct bpf_insn insnsi[0]; struct work_struct work; }; }; so that 'struct bpf_prog' can be used independent of sockets and cleans up 'unattached' bpf use cases split SK_RUN_FILTER macro into: SK_RUN_FILTER to be used with 'struct sk_filter *' and BPF_PROG_RUN to be used with 'struct bpf_prog *' __sk_filter_release(struct sk_filter *) gains __bpf_prog_release(struct bpf_prog *) helper function also perform related renames for the functions that work with 'struct bpf_prog *', since they're on the same lines: sk_filter_size -> bpf_prog_size sk_filter_select_runtime -> bpf_prog_select_runtime sk_filter_free -> bpf_prog_free sk_unattached_filter_create -> bpf_prog_create sk_unattached_filter_destroy -> bpf_prog_destroy sk_store_orig_filter -> bpf_prog_store_orig_filter sk_release_orig_filter -> bpf_release_orig_filter __sk_migrate_filter -> bpf_migrate_filter __sk_prepare_filter -> bpf_prepare_filter API for attaching classic BPF to a socket stays the same: sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *) and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program which is used by sockets, tun, af_packet API for 'unattached' BPF programs becomes: bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *) and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02net: filter: rename sk_convert_filter() -> bpf_convert_filter()Alexei Starovoitov
to indicate that this function is converting classic BPF into eBPF and not related to sockets Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02net: filter: rename sk_chk_filter() -> bpf_check_classic()Alexei Starovoitov
trivial rename to indicate that this functions performs classic BPF checking Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02net: filter: rename sk_filter_proglen -> bpf_classic_proglenAlexei Starovoitov
trivial rename to better match semantics of macro Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02net: filter: simplify socket chargingAlexei Starovoitov
attaching bpf program to a socket involves multiple socket memory arithmetic, since size of 'sk_filter' is changing when classic BPF is converted to eBPF. Also common path of program creation has to deal with two ways of freeing the memory. Simplify the code by delaying socket charging until program is ready and its size is known Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02Merge branch 'next' of git://git.infradead.org/users/pcmoore/selinux into nextJames Morris
2014-08-01netfilter: nf_tables: Avoid duplicate call to nft_data_uninit() for same keyThomas Graf
nft_del_setelem() currently calls nft_data_uninit() twice on the same key. Once to release the key which is guaranteed to be NFT_DATA_VALUE and a second time in the error path to which it falls through. The second call has been harmless so far though because the type passed is always NFT_DATA_VALUE which is currently a no-op. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-08-01netlabel: shorter names for the NetLabel catmap funcs/structsPaul Moore
Historically the NetLabel LSM secattr catmap functions and data structures have had very long names which makes a mess of the NetLabel code and anyone who uses NetLabel. This patch renames the catmap functions and structures from "*_secattr_catmap_*" to just "*_catmap_*" which improves things greatly. There are no substantial code or logic changes in this patch. Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com>
2014-08-01netlabel: fix the catmap walking functionsPaul Moore
The two NetLabel LSM secattr catmap walk functions didn't handle certain edge conditions correctly, causing incorrect security labels to be generated in some cases. This patch corrects these problems and converts the functions to use the new _netlbl_secattr_catmap_getnode() function in order to reduce the amount of repeated code. Cc: stable@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com>
2014-08-01netlabel: fix the horribly broken catmap functionsPaul Moore
The NetLabel secattr catmap functions, and the SELinux import/export glue routines, were broken in many horrible ways and the SELinux glue code fiddled with the NetLabel catmap structures in ways that we probably shouldn't allow. At some point this "worked", but that was likely due to a bit of dumb luck and sub-par testing (both inflicted by yours truly). This patch corrects these problems by basically gutting the code in favor of something less obtuse and restoring the NetLabel abstractions in the SELinux catmap glue code. Everything is working now, and if it decides to break itself in the future this code will be much easier to debug than the code it replaces. One noteworthy side effect of the changes is that it is no longer necessary to allocate a NetLabel catmap before calling one of the NetLabel APIs to set a bit in the catmap. NetLabel will automatically allocate the catmap nodes when needed, resulting in less allocations when the lowest bit is greater than 255 and less code in the LSMs. Cc: stable@vger.kernel.org Reported-by: Christian Evans <frodox@zoho.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com>
2014-08-01netlabel: fix a problem when setting bits below the previously lowest bitPaul Moore
The NetLabel category (catmap) functions have a problem in that they assume categories will be set in an increasing manner, e.g. the next category set will always be larger than the last. Unfortunately, this is not a valid assumption and could result in problems when attempting to set categories less than the startbit in the lowest catmap node. In some cases kernel panics and other nasties can result. This patch corrects the problem by checking for this and allocating a new catmap node instance and placing it at the front of the list. Cc: stable@vger.kernel.org Reported-by: Christian Evans <frodox@zoho.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com>
2014-07-31net: use inet6_iif instead of IP6CB()->iifDuan Jiong
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31net: Correctly set segment mac_len in skb_segment().Vlad Yasevich
When performing segmentation, the mac_len value is copied right out of the original skb. However, this value is not always set correctly (like when the packet is VLAN-tagged) and we'll end up copying a bad value. One way to demonstrate this is to configure a VM which tags packets internally and turn off VLAN acceleration on the forwarding bridge port. The packets show up corrupt like this: 16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0, 0x0000: 8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d. 0x0010: 0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0.. 0x0020: 29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3 0x0030: 000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp 0x0040: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp 0x0050: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp 0x0060: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp ... This also leads to awful throughput as GSO packets are dropped and cause retransmissions. The solution is to set the mac_len using the values already available in then new skb. We've already adjusted all of the header offset, so we might as well correctly figure out the mac_len using skb_reset_mac_len(). After this change, packets are segmented correctly and performance is restored. CC: Eric Dumazet <edumazet@google.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31netlink: Use PAGE_ALIGNED macroTobias Klauser
Use PAGE_ALIGNED(...) instead of IS_ALIGNED(..., PAGE_SIZE). Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORSDuan Jiong
When dealing with ICMPv[46] Error Message, function icmp_socket_deliver() and icmpv6_notify() do some valid checks on packet's length, but then some protocols check packet's length redaudantly. So remove those duplicated statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in function icmp_socket_deliver() and icmpv6_notify() respectively. In addition, add missed counter in udp6/udplite6 when socket is NULL. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31sctp: Fixup v4mapped behaviour to comply with Sock APIJason Gunthorpe
The SCTP socket extensions API document describes the v4mapping option as follows: 8.1.15. Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR) This socket option is a Boolean flag which turns on or off the mapping of IPv4 addresses. If this option is turned on, then IPv4 addresses will be mapped to V6 representation. If this option is turned off, then no mapping will be done of V4 addresses and a user will receive both PF_INET6 and PF_INET type addresses on the socket. See [RFC3542] for more details on mapped V6 addresses. This description isn't really in line with what the code does though. Introduce addr_to_user (renamed addr_v4map), which should be called before any sockaddr is passed back to user space. The new function places the sockaddr into the correct format depending on the SCTP_I_WANT_MAPPED_V4_ADDR option. Audit all places that touched v4mapped and either sanely construct a v4 or v6 address then call addr_to_user, or drop the unnecessary v4mapped check entirely. Audit all places that call addr_to_user and verify they are on a sycall return path. Add a custom getname that formats the address properly. Several bugs are addressed: - SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for addresses to user space - The addr_len returned from recvmsg was not correct when returning AF_INET on a v6 socket - flowlabel and scope_id were not zerod when promoting a v4 to v6 - Some syscalls like bind and connect behaved differently depending on v4mapped Tested bind, getpeername, getsockname, connect, and recvmsg for proper behaviour in v4mapped = 1 and 0 cases. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains netfilter updates for net-next, they are: 1) Add the reject expression for the nf_tables bridge family, this allows us to send explicit reject (TCP RST / ICMP dest unrech) to the packets matching a rule. 2) Simplify and consolidate the nf_tables set dumping logic. This uses netlink control->data to filter out depending on the request. 3) Perform garbage collection in xt_hashlimit using a workqueue instead of a timer, which is problematic when many entries are in place in the tables, from Eric Dumazet. 4) Remove leftover code from the removed ulog target support, from Paul Bolle. 5) Dump unmodified flags in the netfilter packet accounting when resetting counters, so userspace knows that a counter was in overquota situation, from Alexey Perevalov. 6) Fix wrong usage of the bitwise functions in nfnetlink_acct, also from Alexey. 7) Fix a crash when adding new set element with an empty NFTA_SET_ELEM_LIST attribute. This patchset also includes a couple of cleanups for xt_LED from Duan Jiong and for nf_conntrack_ipv4 (using coccinelle) from Himangi Saraogi. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31tcp: don't require root to read tcp_metricsBanerjee, Debabrata
commit d23ff7016 (tcp: add generic netlink support for tcp_metrics) introduced netlink support for the new tcp_metrics, however it restricted getting of tcp_metrics to root user only. This is a change from how these values could have been fetched when in the old route cache. Unless there's a legitimate reason to restrict the reading of these values it would be better if normal users could fetch them. Cc: Julian Anastasov <ja@ssi.bg> Cc: linux-kernel@vger.kernel.org Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31xprtrdma: Handle additional connection eventsChuck Lever
Commit 38ca83a5 added RDMA_CM_EVENT_TIMEWAIT_EXIT. But that status is relevant only for consumers that re-use their QPs on new connections. xprtrdma creates a fresh QP on reconnection, so that event should be explicitly ignored. Squelch the alarming "unexpected CM event" message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macroChuck Lever
Clean up. RPCRDMA_PERSISTENT_REGISTRATION was a compile-time switch between RPCRDMA_REGISTER mode and RPCRDMA_ALLPHYSICAL mode. Since RPCRDMA_REGISTER has been removed, there's no need for the extra conditional compilation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Make rpcrdma_ep_disconnect() return voidChuck Lever
Clean up: The return code is used only for dprintk's that are already redundant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Schedule reply tasklet once per upcallChuck Lever
Minor optimization: grab rpcrdma_tk_lock_g and disable hard IRQs just once after clearing the receive completion queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Allocate each struct rpcrdma_mw separatelyChuck Lever
Currently rpcrdma_buffer_create() allocates struct rpcrdma_mw's as a single contiguous area of memory. It amounts to quite a bit of memory, and there's no requirement for these to be carved from a single piece of contiguous memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Rename frmr_wrChuck Lever
Clean up: Name frmr_wr after the opcode of the Work Request, consistent with the send and local invalidation paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Disable completions for LOCAL_INV Work RequestsChuck Lever
Instead of relying on a completion to change the state of an FRMR to FRMR_IS_INVALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Disable completions for FAST_REG_MR Work RequestsChuck Lever
Instead of relying on a completion to change the state of an FRMR to FRMR_IS_VALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external()Chuck Lever
Any FRMR arriving in rpcrdma_register_frmr_external() is now guaranteed to be either invalid, or to be targeted by a queued LOCAL_INV that will invalidate it before the adapter processes the FAST_REG_MR being built here. The problem with current arrangement of chaining a LOCAL_INV to the FAST_REG_MR is that if the transport is not connected, the LOCAL_INV is flushed and the FAST_REG_MR is flushed. This leaves the FRMR valid with the old rkey. But rpcrdma_register_frmr_external() has already bumped the in-memory rkey. Next time through rpcrdma_register_frmr_external(), a LOCAL_INV and FAST_REG_MR is attempted again because the FRMR is still valid. But the rkey no longer matches the hardware's rkey, and a memory management operation error occurs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work RequestChuck Lever
When a LOCAL_INV Work Request is flushed, it leaves an FRMR in the VALID state. This FRMR can be returned by rpcrdma_buffer_get(), and must be knocked down in rpcrdma_register_frmr_external() before it can be re-used. Instead, capture these in rpcrdma_buffer_get(), and reset them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnectChuck Lever
FAST_REG_MR Work Requests update a Memory Region's rkey. Rkey's are used to block unwanted access to the memory controlled by an MR. The rkey is passed to the receiver (the NFS server, in our case), and is also used by xprtrdma to invalidate the MR when the RPC is complete. When a FAST_REG_MR Work Request is flushed after a transport disconnect, xprtrdma cannot tell whether the WR actually hit the adapter or not. So it is indeterminant at that point whether the existing rkey is still valid. After the transport connection is re-established, the next FAST_REG_MR or LOCAL_INV Work Request against that MR can sometimes fail because the rkey value does not match what xprtrdma expects. The only reliable way to recover in this case is to deregister and register the MR before it is used again. These operations can be done only in a process context, so handle it in the transport connect worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>