summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2014-09-29xfs: annotate user variables passed as voidDave Chinner
Some argument callbacks can contain user buffers, and sparse warns about passing them as void pointers. Cast appropriately to remove the sparse warnings. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29xfs: xfs_kset should be static Dave Chinner
As it is accessed through the struct xfs_mount and can be set up entirely from fs/xfs/xfs_super.c Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29xfs: xfs_qm_dquot_isolate needs locking annotations for sparseDave Chinner
To remove noise from the build. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29xfs: fix use of agi_newino in finobt lookupDave Chinner
Sparse warns that we are passing the big-endian valueo f agi_newino to the initial btree lookup function when trying to find a new inode. This is wrong - we need to pass the host order value, not the disk order value. This will adversely affect the next inode allocated, but given that the free inode btree is usually much smaller than the allocated inode btree it is much less likely to be a performance issue if we start the search in the wrong place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29Merge branch 'xfs-trans-recover-cleanup' into for-nextDave Chinner
2014-09-29xfs: refactor recovery transaction start handlingDave Chinner
Rework the transaction lookup and allocation code in xlog_recovery_process_ophdr() to fold two related call-once helper functions into a single helper. Then fold in all the XLOG_START_TRANS logic to that helper to clean up the remaining logic in xlog_recovery_process_ophdr(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29xfs: reorganise transaction recovery item codeDave Chinner
The code for managing transactions anf the items for recovery is spread across 3 different locations in the file. Move them all together so that it is easy to read the code without needing to jump long distances in the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29xfs: fix double free in xlog_recover_commit_transDave Chinner
When an error occurs during buffer submission in xlog_recover_commit_trans(), we free the trans structure twice. Fix it by only freeing the structure in the caller regardless of the success or failure of the function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29xfs: recovery of XLOG_UNMOUNT_TRANS leaks memoryDave Chinner
The XLOG_UNMOUNT_TRANS case skips the transaction, despite the fact an unmount record is always in a standalone transaction. Hence whenever we come across one of these we need to free the transaction structure associated with it as there is no commit record that follows it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29xfs: refactor xlog_recover_process_data()Dave Chinner
Clean up xlog_recover_process_data() structure in preparation for fixing the allocation and freeing context of the transaction being recovered. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-28NFSv4: fix open/lock state recovery error handlingTrond Myklebust
The current open/lock state recovery unfortunately does not handle errors such as NFS4ERR_CONN_NOT_BOUND_TO_SESSION correctly. Instead of looping, just proceeds as if the state manager is finished recovering. This patch ensures that we loop back, handle higher priority errors and complete the open/lock state recovery. Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-28NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM failsTrond Myklebust
If a NFSv4.x server returns NFS4ERR_STALE_CLIENTID in response to a CREATE_SESSION or SETCLIENTID_CONFIRM in order to tell us that it rebooted a second time, then the client will currently take this to mean that it must declare all locks to be stale, and hence ineligible for reboot recovery. RFC3530 and RFC5661 both suggest that the client should instead rely on the server to respond to inelegible open share, lock and delegation reclaim requests with NFS4ERR_NO_GRACE in this situation. Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes + unifying __d_move() and __d_materialise_dentry() + minimal regression fix for d_path() of victims of overwriting rename() ported on top of that" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Don't exchange "short" filenames unconditionally. fold swapping ->d_name.hash into switch_names() fold unlocking the children into dentry_unlock_parents_for_move() kill __d_materialise_dentry() __d_materialise_dentry(): flip the order of arguments __d_move(): fold manipulations with ->d_child/->d_subdirs don't open-code d_rehash() in d_materialise_unique() pull rehashing and unlocking the target dentry into __d_materialise_dentry() ufs: deal with nfsd/iget races fuse: honour max_read and max_write in direct_io mode shmem: fix nlink for rename overwrite directory
2014-09-27vfs: Don't exchange "short" filenames unconditionally.Mikhail Efremov
Only exchange source and destination filenames if flags contain RENAME_EXCHANGE. In case if executable file was running and replaced by other file /proc/PID/exe should still show correct file name, not the old name of the file by which it was replaced. The scenario when this bug manifests itself was like this: * ALT Linux uses rpm and start-stop-daemon; * during a package upgrade rpm creates a temporary file for an executable to rename it upon successful unpacking; * start-stop-daemon is run subsequently and it obtains the (nonexistant) temporary filename via /proc/PID/exe thus failing to identify the running process. Note that "long" filenames (> DNAiME_INLINE_LEN) are still exchanged without RENAME_EXCHANGE and this behaviour exists long enough (should be fixed too apparently). So this patch is just an interim workaround that restores behavior for "short" names as it was before changes introduced by commit da1ce0670c14 ("vfs: add cross-rename"). See https://lkml.org/lkml/2014/9/7/6 for details. AV: the comments about being more careful with ->d_name.hash than with ->d_name.name are from back in 2.3.40s; they became obsolete by 2.3.60s, when we started to unhash the target instead of swapping hash chain positions followed by d_delete() as we used to do when dcache was first introduced. Acked-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: da1ce0670c14 "vfs: add cross-rename" Signed-off-by: Mikhail Efremov <sem@altlinux.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27fold swapping ->d_name.hash into switch_names()Linus Torvalds
and do it along with ->d_name.len there Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26fold unlocking the children into dentry_unlock_parents_for_move()Al Viro
... renaming it into dentry_unlock_for_move() and making it more symmetric with dentry_lock_for_move(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26kill __d_materialise_dentry()Al Viro
it folds into __d_move() now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26__d_materialise_dentry(): flip the order of argumentsAl Viro
... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry) case in __d_move(). A bit more and it'll fold in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26__d_move(): fold manipulations with ->d_child/->d_subdirsAl Viro
list_del() + list_add() is a slightly pessimised list_move() list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init() Interleaving those makes the resulting code even worse. And harder to follow... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26don't open-code d_rehash() in d_materialise_unique()Al Viro
... and get rid of duplicate BUG_ON() there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26pull rehashing and unlocking the target dentry into __d_materialise_dentry()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26ufs: deal with nfsd/iget racesAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26fuse: honour max_read and max_write in direct_io modeMiklos Szeredi
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of bytes a caller asked to pack into fuse request. This value may be lesser than capacity of fuse request or iov_iter. So fuse_get_user_pages() must ensure that *nbytesp won't grow. Now, when helper iov_iter_get_pages() performs all hard work of extracting pages from iov_iter, it can be done by passing properly calculated "maxsize" to the helper. The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need this capability, so pass LONG_MAX as the maxsize argument here. Fixes: c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()") Reported-by: Werner Baumann <werner.baumann@onlinehome.de> Tested-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26nfsd: introduce nfsd4_callback_opsChristoph Hellwig
Add a higher level abstraction than the rpc_ops for callback operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: split nfsd4_callback initialization and useChristoph Hellwig
Split out initializing the nfs4_callback structure from using it. For the NULL callback this gets rid of tons of pointless re-initializations. Note that I don't quite understand what protects us from running multiple NULL callbacks at the same time, but at least this chance doesn't make it worse.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: introduce a generic nfsd4_cbChristoph Hellwig
Add a helper to queue up a callback. CB_NULL has a bit of special casing because it is special in the specification, but all other new callback operations will be able to share code with this and a few more changes to refactor the callback code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: remove nfsd4_callback.cb_opChristoph Hellwig
We can always get at the private data by using container_of, no need for a void pointer. Also introduce a little to_delegation helper to avoid opencoding the container_of everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: do not clear rpc_resp in nfsd4_cb_done_sequenceBenny Halevy
This is incorrect when a callback is has to be restarted, in which case the XDR decoding of the second iteration will see a NULL cb argument. [hch: updated description] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: fix nfsd4_cb_recall_done error handlingChristoph Hellwig
For any error that is not EBADHANDLE or NFS4ERR_BAD_STATEID, nfsd4_cb_recall_done first marks the connection down, then retries until dl_retries hits zero, then marks the connection down again and sets cb_done. This changes the code to only retry for EBADHANDLE or NFS4ERR_BAD_STATEID, and factors setting cb_done into a single point in the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26fs/cachefiles: add missing \n to kerror conversionsFabian Frederick
Commit 0227d6abb378 ("fs/cachefiles: replace kerror by pr_err") didn't include newline featuring in original kerror definition Signed-off-by: Fabian Frederick <fabf@skynet.be> Reported-by: David Howells <dhowells@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> [3.16.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26mm: softdirty: addresses before VMAs in PTE holes aren't softdirtyPeter Feiner
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap. This bug was introduced in commit 68b5a6524856 ("mm: softdirty: respect VM_SOFTDIRTY in PTE holes"). That commit made /proc/pid/pagemap look at VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs returned by find_vma. Tested: Wrote a selftest that creates a PMD-sized VMA then unmaps the first page and asserts that the page is not softdirty. I'm going to send the pagemap selftest in a later commit. Signed-off-by: Peter Feiner <pfeiner@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Jamie Liu <jamieliu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26ocfs2/dlm: do not get resource spinlock if lockres is newJoseph Qi
There is a deadlock case which reported by Guozhonghua: https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html This case is caused by &res->spinlock and &dlm->master_lock misordering in different threads. It was introduced by commit 8d400b81cc83 ("ocfs2/dlm: Clean up refmap helpers"). Since lockres is new, it doesn't not require the &res->spinlock. So remove it. Fixes: 8d400b81cc83 ("ocfs2/dlm: Clean up refmap helpers") Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Reported-by: Guozhonghua <guozhonghua@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26nilfs2: fix data loss with mmap()Andreas Rohner
This bug leads to reproducible silent data loss, despite the use of msync(), sync() and a clean unmount of the file system. It is easily reproducible with the following script: ----------------[BEGIN SCRIPT]-------------------- mkfs.nilfs2 -f /dev/sdb mount /dev/sdb /mnt dd if=/dev/zero bs=1M count=30 of=/mnt/testfile umount /mnt mount /dev/sdb /mnt CHECKSUM_BEFORE="$(md5sum /mnt/testfile)" /root/mmaptest/mmaptest /mnt/testfile 30 10 5 sync CHECKSUM_AFTER="$(md5sum /mnt/testfile)" umount /mnt mount /dev/sdb /mnt CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)" umount /mnt echo "BEFORE MMAP:\t$CHECKSUM_BEFORE" echo "AFTER MMAP:\t$CHECKSUM_AFTER" echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT" ----------------[END SCRIPT]-------------------- The mmaptest tool looks something like this (very simplified, with error checking removed): ----------------[BEGIN mmaptest]-------------------- data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE, MAP_SHARED, fd, file_offset); for (i = 0; i < write_count; ++i) { memcpy(data + i * 4096, buf, sizeof(buf)); msync(data, file_size - file_offset, MS_SYNC)) } ----------------[END mmaptest]-------------------- The output of the script looks something like this: BEFORE MMAP: 281ed1d5ae50e8419f9b978aab16de83 /mnt/testfile AFTER MMAP: 6604a1c31f10780331a6850371b3a313 /mnt/testfile AFTER REMOUNT: 281ed1d5ae50e8419f9b978aab16de83 /mnt/testfile So it is clear, that the changes done using mmap() do not survive a remount. This can be reproduced a 100% of the time. The problem was introduced in commit 136e8770cd5d ("nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary"). If the page was read with mpage_readpage() or mpage_readpages() for example, then it has no buffers attached to it. In that case page_has_buffers(page) in nilfs_set_page_dirty() will be false. Therefore nilfs_set_file_dirty() is never called and the pages are never collected and never written to disk. This patch fixes the problem by also calling nilfs_set_file_dirty() if the page has no buffers attached to it. [akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/] Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net> Tested-by: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26ocfs2: free vol_label in ocfs2_delete_osb()Joseph Qi
osb->vol_label is malloced in ocfs2_initialize_super but not freed if error occurs or during umount, thus causing a memory leak. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26UBIFS: Align the dump messages of SB_NODEhujianyang
I found the dump messages of UBIFS_SB_NODE is not aligned. This patch remove the extra space from the line which is retracted. Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-25NFS: Fabricate fscache server index key correctlyDavid Howells
When fabricating a server index key for fscache, we should clear the index key buffer before starting to fill it in, not in the middle. Reported-by: James Pearson <james-p@moving-picture.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFSv3: Fix missing includes of nfs3_fs.hTrond Myklebust
Silence a few warnings about missing symbols that are due to missing includes of nfs3_fs.h. Fixes: 00a36a1090350 (NFS: Move v3 declarations out of internal.h) Fixes: cb8c20fa53ec2 (NFS: Move NFS v3 acl functions to nfs3_fs.h) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()NeilBrown
Now that nfs_release_page() doesn't block indefinitely, other deadlock avoidance mechanisms aren't needed. - it doesn't hurt for kswapd to block occasionally. If it doesn't want to block it would clear __GFP_WAIT. The current_is_kswapd() was only added to avoid deadlocks and we have a new approach for that. - memory allocation in the SUNRPC layer can very rarely try to ->releasepage() a page it is trying to handle. The deadlock is removed as nfs_release_page() doesn't block indefinitely. So we don't need to set PF_FSTRANS for sunrpc network operations any more. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS: avoid waiting at all in nfs_release_page when congested.NeilBrown
If nfs_release_page() is called on a sequence of pages which are all in the same file which is blocked on COMMIT, each page could contribute a 1 second delay which could be come excessive. I have seen delays of as much as 208 seconds. To keep the delay to one second, mark the bdi as write-congested if the commit didn't finished. Once it does finish, the write-congested flag will be cleared by nfs_commit_release_pages(). With this, the longest total delay in try_to_free_pages that I have seen is under 3 seconds. With no waiting in nfs_release_page at all I have seen delays of nearly 1.5 seconds. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS: avoid deadlocks with loop-back mounted NFS filesystems.NeilBrown
Support for loop-back mounted NFS filesystems is useful when NFS is used to access shared storage in a high-availability cluster. If the node running the NFS server fails, some other node can mount the filesystem and start providing NFS service. If that node already had the filesystem NFS mounted, it will now have it loop-back mounted. nfsd can suffer a deadlock when allocating memory and entering direct reclaim. While direct reclaim does not write to the NFS filesystem it can send and wait for a COMMIT through nfs_release_page(). This patch modifies nfs_release_page() to wait a limited time for the commit to complete - one second. If the commit doesn't complete in this time, nfs_release_page() will fail. This means it might now fail in some cases where it wouldn't before. These cases are only when 'gfp' includes '__GFP_WAIT'. nfs_release_page() is only called by try_to_release_page(), and that can only be called on an NFS page with required 'gfp' flags from - page_cache_pipe_buf_steal() in splice.c - shrink_page_list() in vmscan.c - invalidate_inode_pages2_range() in truncate.c The first two handle failure quite safely. The last is only called after ->launder_page() has been called, and that will have waited for the commit to finish already. So aborting if the commit takes longer than 1 second is perfectly safe. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24NFS: don't use STABLE writes during writeback.NeilBrown
commit b31268ac793fd300da66b9c28bbf0a200339ab96 FS: Use stable writes when not doing a bulk flush was a bit heavy handed. The particular problem that lead to this patch was that small writes to an O_SYNC file we being written as UNSTABLE writes followed by a commit. This is appropriate for large writes (which require multiple NFS requests) but for small writes (single NFS request), using NFS_FILE_SYNC is more efficient. So that patch causes the code to select between the two methods depending on how many nfs requests get generated. Unfortunately this ends up applying to non O_SYNC writes as well. In particular if you memory-map a file and update random pages, then when they are eventually written out by writeback they will go as NFS_FILE_SYNC. This is inefficient and slows down the application. So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL. With this patch: O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE followed by COMMIT for multiple requests Writing immediately before close of fsync follow the same pattern. Non-O_SYNC writes without an fsync of close eventually get flushed out as UNSTABLE and a commit follows eventually as appropriate. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.NeilBrown
Currently asynchronous NFSv4 request will be retried with exponential timeout (from 1/10 to 15 seconds), but async requests will always use a 15second retry. Some "async" requests are really synchronous though. The async mechanism is used to allow the request to continue if the requesting process is killed. In those cases, an exponential retry is appropriate. For example, if two different clients both open a file and get a READ delegation, and one client then unlinks the file (while still holding an open file descriptor), that unlink will used the "silly-rename" handling which is async. The first rename will result in NFS4ERR_DELAY while the delegation is reclaimed from the other client. The rename will not be retried for 15 seconds, causing an unlink to take 15 seconds rather than 100msec. This patch only added exponential timeout for async unlink and async rename. Other async calls, such as 'close' are sometimes waited for so they might benefit from exponential timeout too. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24lockd: Try to reconnect if statd has movedBenjamin Coddington
If rpc.statd is restarted, upcalls to monitor hosts can fail with ECONNREFUSED. In that case force a lookup of statd's new port and retry the upcall. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24Fixing lease renewalOlga Kornievskaia
Commit c9fdeb28 removed a 'continue' after checking if the lease needs to be renewed. However, if client hasn't moved, the code falls down to starting reboot recovery erroneously (ie., sends open reclaim and gets back stale_clientid error) before recovering from getting stale_clientid on the renew operation. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Fixes: c9fdeb280b8c (NFS: Add basic migration support to state manager thread) Cc: stable@vger.kernel.org # 3.13+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24nfs: fix duplicate proc entriesFabian Frederick
Commit 65b38851a174 ("NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes") updated the following function: static int nfs_volume_list_open(struct inode *inode, struct file *file) it used &nfs_server_list_ops instead of &nfs_volume_list_ops which means cat /proc/fs/nfsfs/volumes = /proc/fs/nfsfs/servers Signed-off-by: Fabian Frederick <fabf@skynet.be> Fixes: 65b38851a174 (NFS: Fix /proc/fs/nfsfs/servers and...) Cc: stable@vger.kernel.org # 3.4.x+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24percpu_ref: add PERCPU_REF_INIT_* flagsTejun Heo
With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-09-24Merge branch 'for-linus' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
2014-09-23f2fs: use more free segments until SSR is activatedJaegeuk Kim
Previously, f2fs activates SSR if the # of free segments reaches to the # of overprovisioned segments. In this case, SSR starts to use dirty segments only, so that the overprovisoned space cannot be selected for new data. This means that we have no chance to utilizae the overprovisioned space at all. This patch fixes that by allowing LFS allocations until the # of free segments reaches to the last threshold, reserved space. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23f2fs: change the ipu_policy option to enable combinationsJaegeuk Kim
This patch changes the ipu_policy setting to use any combination of orthogonal policies. Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23f2fs: fix to search whole dirty segmap when get_victimChao Yu
In ->get_victim we get max_search value from dirty_i->nr_dirty without protection of seglist_lock, after that, nr_dirty can be increased/decreased before we hold seglist_lock lock. Then in main loop we attempt to traverse all dirty section one time to find victim section, but it's not accurate to use max_search as the total loop count, because we might lose checking several sections or check sections redundantly for the case of nr_dirty are increased or decreased previously. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>