summaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)Author
2012-09-13scsi_netlink: Remove dead and buggy codeEric W. Biederman
The scsi netlink code confuses the netlink port id with a process id, going so far as to read NETLINK_CREDS(skb)->pid instead of the correct NETLINK_CB(skb).pid. Fortunately it does not matter because nothing registers to respond to scsi netlink requests. The only interesting use of the scsi_netlink interface is fc_host_post_vendor_event which sends a netlink multicast message. Since nothing registers to handle scsi netlink messages kill all of the registration logic, while retaining the same error handling behavior preserving the userspace visible behavior and removing all of the confused code that thought a netlink port id was a process id. This was tested with a kernel allyesconfig build which had no problems. Cc: James Bottomley <James.Bottomley@parallels.com> Cc: James Smart <James.Smart@Emulex.Com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10etherdevice: introduce help function eth_zero_addr()Duan Jiong
a lot of code has either the memset or an inefficient copy from a static array that contains the all-zeros Ethernet address. Introduce help function eth_zero_addr() to fill an address with all zeros, making the code clearer and allowing us to get rid of some constant arrays. Signed-off-by: Duan Jiong <djduanjiong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10filter: add MOD operationEric Dumazet
Add a new ALU opcode, to compute a modulus. Commit ffe06c17afbbb used an ancillary to implement XOR_X, but here we reserve one of the available ALU opcode to implement both MOD_X and MOD_K Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: George Bakos <gbakos@alpinista.org> Cc: Jay Schulist <jschlst@samba.org> Cc: Jiri Pirko <jpirko@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10netlink: Rename pid to portid to avoid confusionEric W. Biederman
It is a frequent mistake to confuse the netlink port identifier with a process identifier. Try to reduce this confusion by renaming fields that hold port identifiers portid instead of pid. I have carefully avoided changing the structures exported to userspace to avoid changing the userspace API. I have successfully built an allyesconfig kernel with this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-08netlink: hide struct module parameter in netlink_kernel_createPablo Neira Ayuso
This patch defines netlink_kernel_create as a wrapper function of __netlink_kernel_create to hide the struct module *me parameter (which seems to be THIS_MODULE in all existing netlink subsystems). Suggested by David S. Miller. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-08netlink: kill netlink_set_nonrootPablo Neira Ayuso
Replace netlink_set_nonroot by one new field `flags' in struct netlink_kernel_cfg that is passed to netlink_kernel_create. This patch also renames NL_NONROOT_* to NL_CFG_F_NONROOT_* since now the flags field in nl_table is generic (so we can add more flags if needed in the future). Also adjust all callers in the net-next tree to use these flags instead of netlink_set_nonroot. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07ipv4/route: arg delay is useless in rt_cache_flush()Nicolas Dichtel
Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07scm: Don't use struct ucred in NETLINK_CB and struct scm_cookie.Eric W. Biederman
Passing uids and gids on NETLINK_CB from a process in one user namespace to a process in another user namespace can result in the wrong uid or gid being presented to userspace. Avoid that problem by passing kuids and kgids instead. - define struct scm_creds for use in scm_cookie and netlink_skb_parms that holds uid and gid information in kuid_t and kgid_t. - Modify scm_set_cred to fill out scm_creds by heand instead of using cred_to_ucred to fill out struct ucred. This conversion ensures userspace does not get incorrect uid or gid values to look at. - Modify scm_recv to convert from struct scm_creds to struct ucred before copying credential values to userspace. - Modify __scm_send to populate struct scm_creds on in the scm_cookie, instead of just copying struct ucred from userspace. - Modify netlink_sendmsg to copy scm_creds instead of struct ucred into the NETLINK_CB. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05ipv6: fix handling of blackhole and prohibit routesNicolas Dichtel
When adding a blackhole or a prohibit route, they were handling like classic routes. Moreover, it was only possible to add this kind of routes by specifying an interface. Bug already reported here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498 Before the patch: $ ip route add blackhole 2001::1/128 RTNETLINK answers: No such device $ ip route add blackhole 2001::1/128 dev eth0 $ ip -6 route | grep 2001 2001::1 dev eth0 metric 1024 After: $ ip route add blackhole 2001::1/128 $ ip -6 route | grep 2001 blackhole 2001::1 dev lo metric 1024 error -22 v2: wrong patch v3: add a field fc_type in struct fib6_config to store RTN_* type Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05net: qdisc busylock needs lockdep annotationsEric Dumazet
It seems we need to provide ability for stacked devices to use specific lock_class_key for sch->busylock We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but a user might use a qdisc anyway. (So same fixes are probably needed on non LLTX stacked drivers) Noticed while stressing L2TPV3 setup : ====================================================== [ INFO: possible circular locking dependency detected ] 3.6.0-rc3+ #788 Not tainted ------------------------------------------------------- netperf/4660 is trying to acquire lock: (l2tpsock){+.-...}, at: [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core] but task is already holding lock: (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&sch->busylock)->rlock){+.-...}: [<ffffffff810a5df0>] lock_acquire+0x90/0x200 [<ffffffff817499fc>] _raw_spin_lock_irqsave+0x4c/0x60 [<ffffffff81074872>] __wake_up+0x32/0x70 [<ffffffff8136d39e>] tty_wakeup+0x3e/0x80 [<ffffffff81378fb3>] pty_write+0x73/0x80 [<ffffffff8136cb4c>] tty_put_char+0x3c/0x40 [<ffffffff813722b2>] process_echoes+0x142/0x330 [<ffffffff813742ab>] n_tty_receive_buf+0x8fb/0x1230 [<ffffffff813777b2>] flush_to_ldisc+0x142/0x1c0 [<ffffffff81062818>] process_one_work+0x198/0x760 [<ffffffff81063236>] worker_thread+0x186/0x4b0 [<ffffffff810694d3>] kthread+0x93/0xa0 [<ffffffff81753e24>] kernel_thread_helper+0x4/0x10 -> #0 (l2tpsock){+.-...}: [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10 [<ffffffff810a5df0>] lock_acquire+0x90/0x200 [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50 [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth] [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70 [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290 [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00 [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890 [<ffffffff815db019>] ip_output+0x59/0xf0 [<ffffffff815da36d>] ip_local_out+0x2d/0xa0 [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680 [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60 [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30 [<ffffffff815f5300>] tcp_push_one+0x30/0x40 [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040 [<ffffffff81614495>] inet_sendmsg+0x125/0x230 [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0 [<ffffffff81579ece>] sys_sendto+0xfe/0x130 [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&sch->busylock)->rlock); lock(l2tpsock); lock(&(&sch->busylock)->rlock); lock(l2tpsock); *** DEADLOCK *** 5 locks held by netperf/4660: #0: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e581c>] tcp_sendmsg+0x2c/0x1040 #1: (rcu_read_lock){.+.+..}, at: [<ffffffff815da3e0>] ip_queue_xmit+0x0/0x680 #2: (rcu_read_lock_bh){.+....}, at: [<ffffffff815d9ac5>] ip_finish_output+0x135/0x890 #3: (rcu_read_lock_bh){.+....}, at: [<ffffffff81595820>] dev_queue_xmit+0x0/0xe00 #4: (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00 stack backtrace: Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788 Call Trace: [<ffffffff8173dbf8>] print_circular_bug+0x1fb/0x20c [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10 [<ffffffff810a334b>] ? check_usage+0x9b/0x4d0 [<ffffffff810a3f44>] ? __lock_acquire+0x2e4/0x1b10 [<ffffffff810a5df0>] lock_acquire+0x90/0x200 [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50 [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth] [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70 [<ffffffff81594e0e>] ? dev_hard_start_xmit+0x5e/0xa70 [<ffffffff81595961>] ? dev_queue_xmit+0x141/0xe00 [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290 [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00 [<ffffffff81595820>] ? dev_hard_start_xmit+0xa70/0xa70 [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890 [<ffffffff815d9ac5>] ? ip_finish_output+0x135/0x890 [<ffffffff815db019>] ip_output+0x59/0xf0 [<ffffffff815da36d>] ip_local_out+0x2d/0xa0 [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680 [<ffffffff815da3e0>] ? ip_local_out+0xa0/0xa0 [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60 [<ffffffff815fa25e>] ? tcp_md5_do_lookup+0x18e/0x1a0 [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30 [<ffffffff815f5300>] tcp_push_one+0x30/0x40 [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040 [<ffffffff81614495>] inet_sendmsg+0x125/0x230 [<ffffffff81614370>] ? inet_create+0x6b0/0x6b0 [<ffffffff8157e6e2>] ? sock_update_classid+0xc2/0x3b0 [<ffffffff8157e750>] ? sock_update_classid+0x130/0x3b0 [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0 [<ffffffff81162579>] ? fget_light+0x3f9/0x4f0 [<ffffffff81579ece>] sys_sendto+0xfe/0x130 [<ffffffff810a69ad>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8174a0b0>] ? _raw_spin_unlock_irq+0x30/0x50 [<ffffffff810757e3>] ? finish_task_switch+0x83/0xf0 [<ffffffff810757a6>] ? finish_task_switch+0x46/0xf0 [<ffffffff81752cb7>] ? sysret_check+0x1b/0x56 [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05tcp: add generic netlink support for tcp_metricsJulian Anastasov
Add support for genl "tcp_metrics". No locking is changed, only that now we can unlink and delete entries after grace period. We implement get/del for single entry and dump to support show/flush filtering in user space. Del without address attribute causes flush for all addresses, sadly under genl_mutex. v2: - remove rcu_assign_pointer as suggested by Eric Dumazet, it is not needed because there are no other writes under lock - move the flushing code in tcp_metrics_flush_all v3: - remove synchronize_rcu on flush as suggested by Eric Dumazet Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-03Merge branch 'master' of git://1984.lsi.us.es/nf-nextDavid S. Miller
2012-09-03tcp: use PRR to reduce cwin in CWR stateYuchung Cheng
Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state, in addition to Recovery state. Retire the current rate-halving in CWR. When losses are detected via ACKs in CWR state, the sender enters Recovery state but the cwnd reduction continues and does not restart. Rename and refactor cwnd reduction functions since both CWR and Recovery use the same algorithm: tcp_init_cwnd_reduction() is new and initiates reduction state variables. tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery(). tcp_ends_cwnd_reduction() is previously tcp_complete_cwr(). The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(), and the cwnd moderation inside tcp_enter_cwr() are removed. The unused parameter, flag, in tcp_cwnd_reduction() is also removed. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextPablo Neira Ayuso
This merges (3f509c6 netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation) to Patrick McHardy's IPv6 NAT changes.
2012-09-03netfilter: nf_conntrack: add nf_ct_timeout_lookupPablo Neira Ayuso
This patch adds the new nf_ct_timeout_lookup function to encapsulate the timeout policy attachment that is called in the nf_conntrack_in path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-08-31tcp: TCP Fast Open Server - support TFO listenersJerry Chu
This patch builds on top of the previous patch to add the support for TFO listeners. This includes - 1. allocating, properly initializing, and managing the per listener fastopen_queue structure when TFO is enabled 2. changes to the inet_csk_accept code to support TFO. E.g., the request_sock can no longer be freed upon accept(), not until 3WHS finishes 3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg() if it's a TFO socket 4. properly closing a TFO listener, and a TFO socket before 3WHS finishes 5. supporting TCP_FASTOPEN socket option 6. modifying tcp_check_req() to use to check a TFO socket as well as request_sock 7. supporting TCP's TFO cookie option 8. adding a new SYN-ACK retransmit handler to use the timer directly off the TFO socket rather than the listener socket. Note that TFO server side will not retransmit anything other than SYN-ACK until the 3WHS is completed. The patch also contains an important function "reqsk_fastopen_remove()" to manage the somewhat complex relation between a listener, its request_sock, and the corresponding child socket. See the comment above the function for the detail. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31tcp: TCP Fast Open Server - header & support functionsJerry Chu
This patch adds all the necessary data structure and support functions to implement TFO server side. It also documents a number of flags for the sysctl_tcp_fastopen knob, and adds a few Linux extension MIBs. In addition, it includes the following: 1. a new TCP_FASTOPEN socket option an application must call to supply a max backlog allowed in order to enable TFO on its listener. 2. A number of key data structures: "fastopen_rsk" in tcp_sock - for a big socket to access its request_sock for retransmission and ack processing purpose. It is non-NULL iff 3WHS not completed. "fastopenq" in request_sock_queue - points to a per Fast Open listener data structure "fastopen_queue" to keep track of qlen (# of outstanding Fast Open requests) and max_qlen, among other things. "listener" in tcp_request_sock - to point to the original listener for book-keeping purpose, i.e., to maintain qlen against max_qlen as part of defense against IP spoofing attack. 3. various data structure and functions, many in tcp_fastopen.c, to support server side Fast Open cookie operations, including /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31net:stmmac: Remove bus_id from mdio platform data.Srinivas Kandagatla
This patch removes bus_id from mdio platform data, The reason to remove bus_id is, stmmac mdio bus_id is always same as stmmac bus-id, so there is no point in passing this in different variable. Also stmmac ethernet driver connects to phy with bus_id passed its platform data. So, having single bus-id is much simpler. Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31tcp: Increase timeout for SYN segmentsAlex Bergmann
Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side") changed the initRTO from 3secs to 1sec in accordance to RFC6298 (former RFC2988bis). This reduced the time till the last SYN retransmission packet gets sent from 93secs to 31secs. RFC1122 is stating that the retransmission should be done for at least 3 minutes, but this seems to be quite high. "However, the values of R1 and R2 may be different for SYN and data segments. In particular, R2 for a SYN segment MUST be set large enough to provide retransmission of the segment for at least 3 minutes. The application can close the connection (i.e., give up on the open attempt) sooner, of course." This patch increases the value of TCP_SYN_RETRIES to the value of 6, providing a retransmission window of 63secs. The comments for SYN and SYNACK retries have also been updated to describe the current settings. The same goes for the documentation file "Documentation/networking/ip-sysctl.txt". Signed-off-by: Alexander Bergmann <alex@linlab.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge the 'net' tree to get the recent set of netfilter bug fixes in order to assist with some merge hassles Pablo is going to have to deal with for upcoming changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31Merge branch 'master' of git://1984.lsi.us.es/nfDavid S. Miller
2012-08-31netfilter: nf_conntrack: fix racy timer handling with reliable eventsPablo Neira Ayuso
Existing code assumes that del_timer returns true for alive conntrack entries. However, this is not true if reliable events are enabled. In that case, del_timer may return true for entries that were just inserted in the dying list. Note that packets / ctnetlink may hold references to conntrack entries that were just inserted to such list. This patch fixes the issue by adding an independent timer for event delivery. This increases the size of the ecache extension. Still we can revisit this later and use variable size extensions to allocate this area on demand. Tested-by: Oliver Smith <olipro@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-08-30bnx2x: fix 57840_MF pci idYuval Mintz
Commit c3def943c7117d42caaed3478731ea7c3c87190e have added support for new pci ids of the 57840 board, while failing to change the obsolete value in 'pci_ids.h'. This patch does so, allowing the probe of such devices. Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30of/mdio: Add dummy functions in of_mdio.h.Srinivas Kandagatla
This patch adds dummy functions in of_mdio.h, so that driver need not ifdef there code with CONFIG_OF. Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30netfilter: ip6tables: add stateless IPv6-to-IPv6 Network Prefix Translation ↵Patrick McHardy
target Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30netfilter: nf_nat: support IPv6 in SIP NAT helperPatrick McHardy
Add IPv6 support to the SIP NAT helper. There are no functional differences to IPv4 NAT, just different formats for addresses. Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30netfilter: ip6tables: add MASQUERADE targetPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30netfilter: ipv6: add IPv6 NAT supportPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30net: core: add function for incremental IPv6 pseudo header checksum updatesPatrick McHardy
Add inet_proto_csum_replace16 for incrementally updating IPv6 pseudo header checksums for IPv6 NAT. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: David S. Miller <davem@davemloft.net>
2012-08-30netfilter: add protocol independent NAT corePatrick McHardy
Convert the IPv4 NAT implementation to a protocol independent core and address family specific modules. Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30netfilter: nf_nat: add protoff argument to packet mangling functionsPatrick McHardy
For mangling IPv6 packets the protocol header offset needs to be known by the NAT packet mangling functions. Add a so far unused protoff argument and convert the conntrack and NAT helpers to use it in preparation of IPv6 NAT. Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30netfilter: nf_conntrack_ipv6: improve fragmentation handlingPatrick McHardy
The IPv6 conntrack fragmentation currently has a couple of shortcomings. Fragmentes are collected in PREROUTING/OUTPUT, are defragmented, the defragmented packet is then passed to conntrack, the resulting conntrack information is attached to each original fragment and the fragments then continue their way through the stack. Helper invocation occurs in the POSTROUTING hook, at which point only the original fragments are available. The result of this is that fragmented packets are never passed to helpers. This patch improves the situation in the following way: - If a reassembled packet belongs to a connection that has a helper assigned, the reassembled packet is passed through the stack instead of the original fragments. - During defragmentation, the largest received fragment size is stored. On output, the packet is refragmented if required. If the largest received fragment size exceeds the outgoing MTU, a "packet too big" message is generated, thus behaving as if the original fragments were passed through the stack from an outside point of view. - The ipv6_helper() hook function can't receive fragments anymore for connections using a helper, so it is switched to use ipv6_skip_exthdr() instead of the netfilter specific nf_ct_ipv6_skip_exthdr() and the reassembled packets are passed to connection tracking helpers. The result of this is that we can properly track fragmented packets, but still generate ICMPv6 Packet too big messages if we would have before. This patch is also required as a precondition for IPv6 NAT, where NAT helpers might enlarge packets up to a point that they require fragmentation. In that case we can't generate Packet too big messages since the proper MTU can't be calculated in all cases (f.i. when changing textual representation of a variable amount of addresses), so the packet is transparently fragmented iff the original packet or fragments would have fit the outgoing MTU. IPVS parts by Jesper Dangaard Brouer <brouer@redhat.com>. Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-26ipv4: fix path MTU discovery with connection trackingPatrick McHardy
IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and (in case of forwarded packets) refragments them at POST_ROUTING independent of the IP_DF flag. Refragmentation uses the dst_mtu() of the local route without caring about the original fragment sizes, thereby breaking PMTUD. This patch fixes this by keeping track of the largest received fragment with IP_DF set and generates an ICMP fragmentation required error during refragmentation if that size exceeds the MTU. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net>
2012-08-24Merge branch 'for-next' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace This is an initial merge in of Eric Biederman's work to start adding user namespace support to the networking. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== This is a batch of updates intended for 3.7. The bulk of it is mac80211 changes, including some mesh work from Thomas Pederson and some multi-channel work from Johannes. A variety of driver updates and other bits are scattered in there as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24vlan: add helper which can be called to see if device is used by vlanJiri Pirko
also, remove unused vlan_info definition from header CC: Patrick McHardy <kaber@trash.net> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24net: Set device operstate at registration timeBen Hutchings
The operstate of a device is initially IF_OPER_UNKNOWN and is updated asynchronously by linkwatch after each change of carrier state reported by the driver. The default carrier state of a net device is on, and this will never be changed on drivers that do not support carrier detection, thus the operstate remains IF_OPER_UNKNOWN. For devices that do support carrier detection, the driver must set the carrier state to off initially, then poll the hardware state when the device is opened. However, we must not activate linkwatch for a unregistered device, and commit b473001 ('net: Do not fire linkwatch events until the device is registered.') ensured that we don't. But this means that the operstate for many devices that support carrier detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN. The same issue exists with the dormant state. The proper initialisation sequence, avoiding a race with opening of the device, is: rtnl_lock(); rc = register_netdevice(dev); if (rc) goto out_unlock; netif_carrier_off(dev); /* or netif_dormant_on(dev) */ rtnl_unlock(); but it seems silly that this should have to be repeated in so many drivers. Further, the operstate seen immediately after opening the device may still be IF_OPER_UNKNOWN due to the asynchronous nature of linkwatch. Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted to fix this by setting the operstate synchronously, but it was reverted as it could lead to deadlock. This initialises the operstate synchronously at registration time only. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2012-08-23packet: fix broken build.Rami Rosen
This patch fixes a broken build due to a missing header: ... CC net/ipv4/proc.o In file included from include/net/net_namespace.h:15, from net/ipv4/proc.c:35: include/net/netns/packet.h:11: error: field 'sklist_lock' has incomplete type ... The lock of netns_packet has been replaced by a recent patch to be a mutex instead of a spinlock, but we need to replace the header file to be linux/mutex.h instead of linux/spinlock.h as well. See commit 0fa7fa98dbcc2789409ed24e885485e645803d7f: packet: Protect packet sk list with mutex (v2) patch, Signed-off-by: Rami Rosen <rosenr@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-23Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
2012-08-22packet: Protect packet sk list with mutex (v2)Pavel Emelyanov
Change since v1: * Fixed inuse counters access spotted by Eric In patch eea68e2f (packet: Report socket mclist info via diag module) I've introduced a "scheduling in atomic" problem in packet diag module -- the socket list is traversed under rcu_read_lock() while performed under it sk mclist access requires rtnl lock (i.e. -- mutex) to be taken. [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002 [152363.820573] 4 locks held by crtools/12517: [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af Similar thing was then re-introduced by further packet diag patches (fanount mutex and pgvec mutex for rings) :( Apart from being terribly sorry for the above, I propose to change the packet sk list protection from spinlock to mutex. This lock currently protects two modifications: * sklist * prot inuse counters The sklist modifications can be just reprotected with mutex since they already occur in a sleeping context. The inuse counters modifications are trickier -- the __this_cpu_-s are used inside, thus requiring the caller to handle the potential issues with contexts himself. Since packet sockets' counters are modified in two places only (packet_create and packet_release) we only need to protect the context from being preempted. BH disabling is not required in this case. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22mdio: translation of MMD EEE registers to/from ethtool settingsAllan, Bruce W
The helper functions which translate IEEE MDIO Manageable Device (MMD) Energy-Efficient Ethernet (EEE) registers 3.20, 7.60 and 7.61 to and from the comparable ethtool supported/advertised settings will be needed by drivers other than those in PHYLIB (e.g. e1000e in a follow-on patch). In the same fashion as similar translation functions in linux/mii.h, move these functions from the PHYLIB core to the linux/mdio.h header file so the code will not have to be duplicated in each driver needing MMD-to-ethtool (and vice-versa) translations. The function and some variable names have been renamed to be more descriptive. Not tested on the only hardware that currently calls the related functions, stmmac, because I don't have access to any. Has been compile tested and the translations have been tested on a locally modified version of e1000e. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22net: remove delay at device dismantleEric Dumazet
I noticed extra one second delay in device dismantle, tracked down to a call to dst_dev_event() while some call_rcu() are still in RCU queues. These call_rcu() were posted by rt_free(struct rtable *rt) calls. We then wait a little (but one second) in netdev_wait_allrefs() before kicking again NETDEV_UNREGISTER. As the call_rcu() are now completed, dst_dev_event() can do the needed device swap on busy dst. To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called after a rcu_barrier(), but outside of RTNL lock. Use NETDEV_UNREGISTER_FINAL with care ! Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after IP cache removal. With help from Gao feng Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22Merge git://1984.lsi.us.es/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== This is the first batch of Netfilter and IPVS updates for your net-next tree. Mostly cleanups for the Netfilter side. They are: * Remove unnecessary RTNL locking now that we have support for namespace in nf_conntrack, from Patrick McHardy. * Cleanup to eliminate unnecessary goto in the initialization path of several Netfilter tables, from Jean Sacren. * Another cleanup from Wu Fengguang, this time to PTR_RET instead of if IS_ERR then return PTR_ERR. * Use list_for_each_entry_continue_rcu in nf_iterate, from Michael Wang. * Add pmtu_disc sysctl option to disable PMTU in their tunneling transmitter, from Julian Anastasov. * Generalize application protocol registration in IPVS and modify IPVS FTP helper to use it, from Julian Anastasov. * update Kconfig. The IPVS FTP helper depends on the Netfilter FTP helper for NAT support, from Julian Anastasov. * Add logic to update PMTU for IPIP packets in IPVS, again from Julian Anastasov. * A couple of sparse warning fixes for IPVS and Netfilter from Claudiu Ghioc and Patrick McHardy respectively. Patrick's IPv6 NAT changes will follow after this batch, I need to flush this batch first before refreshing my tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next Jeff Kirsher says: ==================== This series contains updates to ethtool.h, e1000, e1000e, and igb to implement MDI/MDIx control. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2012-08-22Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm fixes from Dave Airlie: "Intel: edid fixes, power consumption fix, s/r fix, haswell fix Radeon: BIOS loading fixes for UEFI and Thunderbolt machines, better MSAA validation, lockup timeout fixes, modesetting fixes One udl dpms fix, one vmwgfx fix, a couple of trivial core changes. There is an export added to ACPI as part of the radeon bios fixes. I've also included the fbcon flashing cursor vs deinit race fix, that seems the simplest place to start" Trivial conflict in drivers/video/console/fbcon.c due to me having already applied the fbcon flashing cursor vs deinit race fix, and Dave had added a comment in there too. * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (22 commits) fbcon: fix race condition between console lock and cursor timer (v1.1) drm: Add missing static storage class specifiers in drm_proc.c file drm/udl: dpms off the crtc when disabled. drm: Remove two unused fields from struct drm_display_mode drm: stop vmgfx driver explosion drm/radeon/ss: use num_crtc rather than hardcoded 6 Revert "drm/radeon: fix bo creation retry path" drm/i915: use hsw rps tuning values everywhere on gen6+ drm/radeon: split ATRM support out from the ATPX handler (v3) drm/radeon: convert radeon vfct code to use acpi_get_table_with_size ACPI: export symbol acpi_get_table_with_size drm/radeon: implement ACPI VFCT vbios fetch (v3) drm/radeon/kms: extend the Fujitsu D3003-S2 board connector quirk to cover later silicon stepping drm/radeon: fix checking of MSAA renderbuffers on r600-r700 drm/radeon: allow CMASK and FMASK in the CS checker on r600-r700 drm/radeon: init lockup timeout on ring init drm/radeon: avoid turning off spread spectrum for used pll drm/i915: fall back to bit-banging if GMBUS fails in CRT EDID reads drm/i915: extract connector update from intel_ddc_get_modes() for reuse drm/i915: fix hsw uncached pte ...
2012-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pendingLinus Torvalds
Pull SCSI target fixes from Nicholas Bellinger: "The executive summary includes: - Post-merge review comments for tcm_vhost (MST + nab) - Avoid debugging overhead when not debugging for tcm-fc(FCoE) (MDR) - Fix NULL pointer dereference bug on alloc_page failulre (Yi Zou) - Fix REPORT_LUNs regression bug with pSCSI export (AlexE + nab) - Fix regression bug with handling of zero-length data CDBs (nab) - Fix vhost_scsi_target structure alignment (MST) Thanks again to everyone who contributed a bugfix patch, gave review feedback on tcm_vhost code, and/or reported a bug during their own testing over the last weeks. There is one other outstanding bug reported by Roland recently related to SCSI transfer length overflow handling, for which the current proposed bugfix has been left in queue pending further testing with other non iscsi-target based fabric drivers. As the patch is verified with loopback (local SGL memory from SCSI LLD) + tcm_qla2xxx (TCM allocated SGL memory mapped to PCI HW) fabric ports, it will be included into the next 3.6-rc-fixes PULL request." * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: target: Remove unused se_cmd.cmd_spdtl tcm_fc: rcu_deref outside rcu lock/unlock section tcm_vhost: Fix vhost_scsi_target structure alignment target: Fix regression bug with handling of zero-length data CDBs target/pscsi: Fix bug with REPORT_LUNs handling for SCSI passthrough tcm_vhost: Change vhost_scsi_target->vhost_wwpn to char * target: fix NULL pointer dereference bug alloc_page() fails to get memory tcm_fc: Avoid debug overhead when not debugging tcm_vhost: Post-merge review changes requested by MST tcm_vhost: Fix incorrect IS_ERR() usage in vhost_scsi_map_iov_to_sgl
2012-08-22Merge tag 'nfs-for-3.6-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - NFSv3 mounts need to fail if the FSINFO rpc call fails - Ensure that the NFS commit cache gets torn down when we unload the NFS module. - Fix memory scribble issues when interrupting a LAYOUTGET rpc call - Fix NFSv4 legacy idmapper regressions - Fix issues with the NFSv4 getacl command - Fix a regression when using the legacy "mount -t nfs4" * tag 'nfs-for-3.6-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv3: Ensure that do_proc_get_root() reports errors correctly NFSv4: Ensure that nfs4_alloc_client cleans up on error. NFS: return -ENOKEY when the upcall fails to map the name NFS: Clear key construction data if the idmap upcall fails NFSv4: Don't use private xdr_stream fields in decode_getacl NFSv4: Fix the acl cache size calculation NFSv4: Fix pointer arithmetic in decode_getacl NFS: Alias the nfs module to nfs4 NFS: Fix a regression when loading the NFS v4 module NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done pnfs-obj: Better IO pattern in case of unaligned offset NFS41: add pg_layout_private to nfs_pageio_descriptor pnfs: nfs4_proc_layoutget returns void pnfs: defer release of pages in layoutget nfs: tear down caches in nfs_init_writepagecache when allocation fails
2012-08-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull assorted fixes - mostly vfs - from Al Viro: "Assorted fixes, with an unexpected detour into vfio refcounting logics (fell out when digging in an analog of eventpoll race in there)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: task_work: add a scheduling point in task_work_run() fs: fix fs/namei.c kernel-doc warnings eventpoll: use-after-possible-free in epoll_create1() vfio: grab vfio_device reference *before* exposing the sucker via fd_install() vfio: get rid of vfio_device_put()/vfio_group_get_device* races vfio: get rid of open-coding kref_put_mutex introduce kref_put_mutex() vfio: don't dereference after kfree... mqueue: lift mnt_want_write() outside ->i_mutex, clean up a bit