summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2015-02-03nfs41: allow LD to choose DS connection version/minor_versionPeng Tao
flexfile layout may need to set such when making DS connections. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03nfsv3: introduce nfs3_set_ds_clientPeng Tao
The flexfiles layout wants to create DS connection over NFSv3. Add nfs3_set_ds_client to allow that to happen. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03nfs41: move file layout macros to generic pnfsPeng Tao
They can be reused by flexfile layout as well. Also add a code such that if read fails on one DS and there are other DSes available to use, don't resend through MDS but through pNFS so that client can read from other DSes. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03nfs41: allow LD to choose DS connection auth flavorPeng Tao
flexfile layout may use different auth flavor as specified by MDS. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03nfs41: pull nfs4_ds_connect from file layout to generic pnfsPeng Tao
It can be reused by flexfiles layout client. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03nfs41: pull decode_ds_addr from file layout to generic pnfsPeng Tao
It can be reused by flexfile layout. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03nfs41: pull data server cache from file layout to generic pnfsPeng Tao
Also pull nfs4_pnfs_ds_addr and nfs4_pnfs_ds to generic pnfs. They can all be reused by flexfile layout as well. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03pnfs: Do not grab the commit_info lock twice when rescheduling writesTom Haynes
Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03pnfs: Prepare for flexfiles by pulling out common codeTom Haynes
The flexfilelayout driver will share some common code with the filelayout driver. This set of changes refactors that common code out to avoid any module depenencies. Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03Merge tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust
NFS: Client side changes for RDMA These patches improve the scalability of the NFSoRDMA client and take large variables off of the stack. Additionally, the GFP_* flags are updated to match what TCP uses. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (21 commits) xprtrdma: Update the GFP flags used in xprt_rdma_allocate() xprtrdma: Clean up after adding regbuf management xprtrdma: Allocate zero pad separately from rpcrdma_buffer xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req xprtrdma: Add struct rpcrdma_regbuf and helpers xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy() xprtrdma: Simplify synopsis of rpcrdma_buffer_create() xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack xprtrdma: Take struct ib_device_attr off the stack xprtrdma: Free the pd if ib_query_qp() fails xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt xprtrdma: Move credit update to RPC reply handler xprtrdma: Remove rl_mr field, and the mr_chunk union xprtrdma: Remove rpcrdma_ep::rep_ia xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt xprtrdma: Clean up hdrlen xprtrdma: Display XIDs in host byte order xprtrdma: Modernize htonl and ntohl ...
2015-01-30NFS: a couple off by onesDan Carpenter
These tests are off by one because if len == sizeof(nfs_export_path) then we have truncated the name. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30nfs: prevent truncate on active swapfileOmar Sandoval
Most filesystems prevent truncation of an active swapfile by way of inode_newsize_ok, called from inode_change_ok. NFS doesn't call either from nfs_setattr, presumably because most of these checks are expected to be done server-side. However, the IS_SWAPFILE check can only be done client-side, and truncating a swapfile can't possibly be good. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30nfs: don't call blocking operations while !TASK_RUNNINGJeff Layton
Bruce reported seeing this warning pop when mounting using v4.1: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90 Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78 0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000 Call Trace: [<ffffffff8186ac78>] dump_stack+0x4c/0x65 [<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0 [<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70 [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90 [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90 [<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0 [<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430 [<ffffffff810d941e>] ? groups_alloc+0x3e/0x130 [<ffffffff810d941e>] groups_alloc+0x3e/0x130 [<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc] [<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc] [<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc] [<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc] [<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4] [<ffffffff810ff970>] ? wait_woken+0xc0/0xc0 [<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4] [<ffffffff810d45bf>] kthread+0x11f/0x140 [<ffffffff810ea815>] ? local_clock+0x15/0x30 [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250 [<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0 [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250 ---[ end trace 675220a11e30f4f2 ]--- nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE, which is just wrong. Fix that by finishing the wait immediately if we've found that the list has something on it. Also, we don't expect this kthread to accept signals, so we should be using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up hung task warnings from the watchdog, so have the schedule_timeout wake up every 60s if there's no callback activity. Reported-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30xprtrdma: Update the GFP flags used in xprt_rdma_allocate()Chuck Lever
Reflect the more conservative approach used in the socket transport's version of this transport method. An RPC buffer allocation should avoid forcing not just FS activity, but any I/O. In particular, two recent changes missed updating xprtrdma: - Commit c6c8fe79a83e ("net, sunrpc: suppress allocation warning ...") - Commit a564b8f03986 ("nfs: enable swap on NFS") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Clean up after adding regbuf managementChuck Lever
rpcrdma_{de}register_internal() are used only in verbs.c now. MAX_RPCRDMAHDR is no longer used and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate zero pad separately from rpcrdma_bufferChuck Lever
Use the new rpcrdma_alloc_regbuf() API to shrink the amount of contiguous memory needed for a buffer pool by moving the zero pad buffer into a regbuf. This is for consistency with the other uses of internally registered memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_repChuck Lever
The rr_base field is currently the buffer where RPC replies land. An RPC/RDMA reply header lands in this buffer. In some cases an RPC reply header also lands in this buffer, just after the RPC/RDMA header. The inline threshold is an agreed-on size limit for RDMA SEND operations that pass from server and client. The sum of the RPC/RDMA reply header size and the RPC reply header size must be less than this threshold. The largest RDMA RECV that the client should have to handle is the size of the inline threshold. The receive buffer should thus be the size of the inline threshold, and not related to RPCRDMA_MAX_SEGS. RPC replies received via RDMA WRITE (long replies) are caught in rq_rcv_buf, which is the second half of the RPC send buffer. Ie, such replies are not involved in any way with rr_base. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_reqChuck Lever
The rl_base field is currently the buffer where each RPC/RDMA call header is built. The inline threshold is an agreed-on size limit to for RDMA SEND operations that pass between client and server. The sum of the RPC/RDMA header size and the RPC header size must be less than or equal to this threshold. Increasing the r/wsize maximum will require MAX_SEGS to grow significantly, but the inline threshold size won't change (both sides agree on it). The server's inline threshold doesn't change. Since an RPC/RDMA header can never be larger than the inline threshold, make all RPC/RDMA header buffers the size of the inline threshold. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_reqChuck Lever
Because internal memory registration is an expensive and synchronous operation, xprtrdma pre-registers send and receive buffers at mount time, and then re-uses them for each RPC. A "hardway" allocation is a memory allocation and registration that replaces a send buffer during the processing of an RPC. Hardway must be done if the RPC send buffer is too small to accommodate an RPC's call and reply headers. For xprtrdma, each RPC send buffer is currently part of struct rpcrdma_req so that xprt_rdma_free(), which is passed nothing but the address of an RPC send buffer, can find its matching struct rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof. That means that hardway currently has to replace a whole rpcrmda_req when it replaces an RPC send buffer. This is often a fairly hefty chunk of contiguous memory due to the size of the rl_segments array and the fact that both the send and receive buffers are part of struct rpcrdma_req. Some obscure re-use of fields in rpcrdma_req is done so that xprt_rdma_free() can detect replaced rpcrdma_req structs, and restore the original. This commit breaks apart the RPC send buffer and struct rpcrdma_req so that increasing the size of the rl_segments array does not change the alignment of each RPC send buffer. (Increasing rl_segments is needed to bump up the maximum r/wsize for NFS/RDMA). This change opens up some interesting possibilities for improving the design of xprt_rdma_allocate(). xprt_rdma_allocate() is now the one place where RPC send buffers are allocated or re-allocated, and they are now always left in place by xprt_rdma_free(). A large re-allocation that includes both the rl_segments array and the RPC send buffer is no longer needed. Send buffer re-allocation becomes quite rare. Good send buffer alignment is guaranteed no matter what the size of the rl_segments array is. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Add struct rpcrdma_regbuf and helpersChuck Lever
There are several spots that allocate a buffer via kmalloc (usually contiguously with another data structure) and then register that buffer internally. I'd like to split the buffers out of these data structures to allow the data structures to scale. Start by adding functions that can kmalloc and register a buffer, and can manage/preserve the buffer's associated ib_sge and ib_mr fields. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()Chuck Lever
Move the details of how to create and destroy rpcrdma_req and rpcrdma_rep structures into helper functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Simplify synopsis of rpcrdma_buffer_create()Chuck Lever
Clean up: There is one call site for rpcrdma_buffer_create(). All of the arguments there are fields of an rpcrdma_xprt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stackChuck Lever
Reduce stack footprint of the connection upcall handler function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Take struct ib_device_attr off the stackChuck Lever
Device attributes are large, and are used in more than one place. Stash a copy in dynamically allocated memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Free the pd if ib_query_qp() failsChuck Lever
If ib_query_qp() fails or the memory registration mode isn't supported, don't leak the PD. An orphaned IB/core resource will cause IB module removal to hang. Fixes: bd7ed1d13304 ("RPC/RDMA: check selected memory registration ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprtChuck Lever
Clean up: The rep_func field always refers to rpcrdma_conn_func(). rep_func should have been removed by commit b45ccfd25d50 ("xprtrdma: Remove MEMWINDOWS registration modes"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Move credit update to RPC reply handlerChuck Lever
Reduce work in the receive CQ handler, which can be run at hardware interrupt level, by moving the RPC/RDMA credit update logic to the RPC reply handler. This has some additional benefits: More header sanity checking is done before trusting the incoming credit value, and the receive CQ handler no longer touches the RPC/RDMA header (the CPU stalls while waiting for the header contents to be brought into the cache). This further extends work begun by commit e7ce710a8802 ("xprtrdma: Avoid deadlock when credit window is reset"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Remove rl_mr field, and the mr_chunk unionChuck Lever
Clean up: Since commit 0ac531c18323 ("xprtrdma: Remove REGISTER memory registration mode"), the rl_mr pointer is no longer used anywhere. After removal, there's only a single member of the mr_chunk union, so mr_chunk can be removed as well, in favor of a single pointer field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Remove rpcrdma_ep::rep_iaChuck Lever
Clean up: This field is not used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprtChuck Lever
Clean up: Use consistent field names in struct rpcrdma_xprt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Clean up hdrlenChuck Lever
Clean up: Replace naked integers with a documenting macro. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Display XIDs in host byte orderChuck Lever
xprtsock.c and the backchannel code display XIDs in host byte order. Follow suit in xprtrdma. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: Modernize htonl and ntohlChuck Lever
Clean up: Replace htonl and ntohl with the be32 equivalents. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30xprtrdma: human-readable completion statusChuck Lever
Make it easier to grep the system log for specific error conditions. The wc.opcode field is not included because opcode numbers are sparse, and because wc.opcode is not necessarily valid when completion reports an error. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-24NFSv4: Deal with atomic upgrades of an existing delegationTrond Myklebust
Ensure that we deal correctly with the case where the server sends us a newer instance of the same delegation. If the stateids match, but the sequence numbers differ, then treat the new delegation as if it were an atomic upgrade. Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com>
2015-01-24NFSv4.1: Replace usage of nfs_client->cl_addr in encode_create_sessionTrond Myklebust
Replace the current code with something that is a little closer to what net/sunrpc/auth_unix.c uses. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24SUNRPC: Allow waiting on memory allocationTrond Myklebust
We should be safe now, as long as we don't do GFP_IO or higher allocations Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24SUNRPC: Adjust rpciod workqueue parametersTrond Myklebust
Increase the concurrency level for rpciod threads to allow for allocations etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range locks may now need to allocate memory from inside rpciod. Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24NFSv4.1: Optimise layout return-on-closeTrond Myklebust
Optimise the layout return on close code by ensuring that 1) Add a check for whether we hold a layout before taking any spinlocks 2) Only take the spin lock once 3) Use nfs_state->state to speed up open file checks Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24NFSv4.1: Allow parallel LOCK/LOCKU callsTrond Myklebust
Note, however, that we still serialise on the open stateid if the lock stateid is unconfirmed. Hopefully that will not prove too much of a burden for first time locks; it should leave the ability to parallelise OPENs unchanged, since they no longer call the serialisation primitives. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24NFSv4: Update of VFS byte range lock must be atomic with the stateid updateTrond Myklebust
Ensure that we test the lock stateid remained unchanged while we were updating the VFS tracking of the byte range lock. Have the process replay the lock to the server if we detect that was not the case. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24NFSv4: Fix lock on-wire reordering issuesTrond Myklebust
This patch ensures that the server cannot reorder our LOCK/LOCKU requests if they are sent in parallel on the wire. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24NFSv4: Always do open_to_lock_owner if the lock stateid is uninitialisedTrond Myklebust
The original text in RFC3530 was terribly confusing since it conflated lockowners and lock stateids. RFC3530bis clarifies that you must use open_to_lock_owner when there is no lock state for that file+lockowner combination. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24NFSv4: Fix atomicity problems with lock stateid updatesTrond Myklebust
When we update the lock stateid, we really do need to ensure that this is done under the state->state_lock, and that we are indeed only updating confirmed locks with a newer version of the same stateid. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23NFSv4.1: Allow parallel OPEN/OPEN_DOWNGRADE/CLOSETrond Myklebust
Remove the serialisation of OPEN/OPEN_DOWNGRADE and CLOSE calls for the case of NFSv4.1 and newer. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23NFSv4: Check for NULL argument in nfs_*_seqid() functionsTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23NFSv4: Convert nfs_alloc_seqid() to return an ERR_PTR() if allocation failsTrond Myklebust
When we relax the sequencing on the NFSv4.1 OPEN/CLOSE code, we will want to use the value NULL to indicate that no sequencing is needed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23NFSv4: More CLOSE/OPEN racesTrond Myklebust
If an OPEN RPC call races with a CLOSE or OPEN_DOWNGRADE so that it updates the nfs_state structure before the CLOSE/OPEN_DOWNGRADE has a chance to do so, then we know that the state->flags need to be recalculated from scratch. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23NFSv4: Fix an atomicity problem in CLOSETrond Myklebust
If we are to remove the serialisation of OPEN/CLOSE, then we need to ensure that the stateid sent as part of a CLOSE operation does not change after we test the state in nfs4_close_prepare. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21NFS: Fix use of nfs_attr_use_mounted_on_fileid()Anna Schumaker
This function call was being optimized out during nfs_fhget(), leading to situations where we have a valid fileid but still want to use the mounted_on_fileid. For example, imagine we have our server configured like this: server % df Filesystem Size Used Avail Use% Mounted on /dev/vda1 9.1G 6.5G 1.9G 78% / /dev/vdb1 487M 2.3M 456M 1% /exports /dev/vdc1 487M 2.3M 456M 1% /exports/vol1 /dev/vdd1 487M 2.3M 456M 1% /exports/vol2 If our client mounts /exports and tries to do a "chown -R" across the entire mountpoint, we will get a nasty message warning us about a circular directory structure. Running chown with strace tells me that each directory has the same device and inode number: newfstatat(AT_FDCWD, "/nfs/", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0 newfstatat(4, "vol1", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0 newfstatat(4, "vol2", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0 With this patch the mounted_on_fileid values are used for st_ino, so the directory loop warning isn't reported. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>