summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2010-05-19xfs: fix min bufsize bugs in two placesAlex Elder
This fixes a bug in two places that I found by inspection. In xlog_find_verify_cycle() and xlog_write_log_records(), the code attempts to allocate a buffer to hold as many blocks as possible. It gives up if the number of blocks to be allocated gets too small. Right now it uses log->l_sectbb_log as that lower bound, but I'm sure it's supposed to be the actual log sector size instead. That is, the lower bound should be (1 << log->l_sectbb_log). Also define a simple macro xlog_sectbb(log) to represent the number of basic blocks in a sector for the given log. (No change from original submission; I have implemented Christoph's suggestion about storing l_sectsize rather than l_sectbb_log in a new, separate patch in this series.) Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-05-19xfs: add const qualifiers to xfs error function argsAlex Elder
Change the tag and file name arguments to xfs_error_report() and xfs_corruption_error() to use a const qualifier. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-05-19xfs: remove xfs_dqmarkerChristoph Hellwig
The xfs_dqmarker structure does not need to exist anymore. Move the remaining flags field out of it and remove the structure altogether. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19xfs: convert the dquot free list to use list headsDave Chinner
Convert the dquot free list on the filesystem to use listhead infrastructure rather than the roll-your-own in the quota code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: convert the dquot hash list to use list headsDave Chinner
Convert the dquot hash list on the filesystem to use listhead infrastructure rather than the roll-your-own in the quota code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: remove duplicate code from dquot reclaimDave Chinner
The dquot shaker and the free-list reclaim code use exactly the same algorithm but the code is duplicated and slightly different in each case. Make the shaker code use the single dquot reclaim code to remove the code duplication. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: convert the per-mount dquot list to use list headsDave Chinner
Convert the dquot list on the filesytesm to use listhead infrastructure rather than the roll-your-own in the quota code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: add log item recovery tracingDave Chinner
Currently there is no tracing in log recovery, so it is difficult to determine what is going on when something goes wrong. Add tracing for log item recovery to provide visibility into the log recovery process. The tracing added shows regions being extracted from the log transactions and added to the transaction hash forming recovery items, followed by the reordering, cancelling and finally recovery of the items. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: clean up xlog_write_adv_cntChristoph Hellwig
Replace the awkward xlog_write_adv_cnt with an inline helper that makes it more obvious that it's modifying it's paramters, and replace the use of an integer type for "ptr" with a real void pointer. Also move xlog_write_adv_cnt to xfs_log_priv.h as it will be used outside of xfs_log.c in the delayed logging series. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19xfs: introduce new internal log vector structureDave Chinner
The current log IO vector structure is a flat array and not extensible. To make it possible to keep separate log IO vectors for individual log items, we need a method of chaining log IO vectors together. Introduce a new log vector type that can be used to wrap the existing log IO vectors on use that internally to the log. This means that the existing external interface (xfs_log_write) does not change and hence no changes to the transaction commit code are required. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19xfs: reindent xlog_writeChristoph Hellwig
Reindent xlog_write to normal one tab indents and move all variable declarations into the closest enclosing block. Split from a bigger patch by Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19xfs: factor xlog_writeDave Chinner
xlog_write is a mess that takes a lot of effort to understand. It is a mass of nested loops with 4 space indents to get it to fit in 80 columns and lots of funky variables that aren't obvious what they mean or do. Break it down into understandable chunks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19xfs: log ticket reservation underestimates the number of iclogsDave Chinner
When allocation a ticket for a transaction, the ticket is initialised with the worst case log space usage based on the number of bytes the transaction may consume. Part of this calculation is the number of log headers required for the iclog space used up by the transaction. This calculation makes an undocumented assumption that if the transaction uses the log header space reservation on an iclog, then it consumes either the entire iclog or it completes. That is - the transaction that is first in an iclog is the transaction that the log header reservation is accounted to. If the transaction is larger than the iclog, then it will use the entire iclog itself. Document this assumption. Further, the current calculation uses the rule that we can fit iclog_size bytes of transaction data into an iclog. This is in correct - the amount of space available in an iclog for transaction data is the size of the iclog minus the space used for log record headers. This means that the calculation is out by 512 bytes per 32k of log space the transaction can consume. This is rarely an issue because maximally sized transactions are extremely uncommon, and for 4k block size filesystems maximal transaction reservations are about 400kb. Hence the error in this case is less than the size of an iclog, so that makes it even harder to hit. However, anyone using larger directory blocks (16k directory blocks push the maximum transaction size to approx. 900k on a 4k block size filesystem) or larger block size (e.g. 64k blocks push transactions to the 3-4MB size) could see the error grow to more than an iclog and at this point the transaction is guaranteed to get a reservation underrun and shutdown the filesystem. Fix this by adjusting the calculation to calculate the correct number of iclogs required and account for them all up front. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: Clean up xfs_trans_committed code after factoringDave Chinner
Now that the code has been factored, clean up all the remaining style cruft, simplify the code and re-order functions so that it doesn't need forward declarations. Also move the remaining functions that require forward declarations (xfs_trans_uncommit, xfs_trans_free) so that all the forward declarations can be removed from the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: update and factor xfs_trans_committed()Dave Chinner
The function header to xfs-trans_committed has long had this comment: * THIS SHOULD BE REWRITTEN TO USE xfs_trans_next_item() To prepare for different methods of committing items, convert the code to use xfs_trans_next_item() and factor the code into smaller, more digestible chunks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: clean up xfs_trans_commit logic even moreChristoph Hellwig
> +shut_us_down: > + shutdown = XFS_FORCED_SHUTDOWN(mp) ? EIO : 0; > + if (!(tp->t_flags & XFS_TRANS_DIRTY) || shutdown) { > + xfs_trans_unreserve_and_mod_sb(tp); > + /* This whole area in _xfs_trans_commit is still a complete mess. So while touching this code, unravel this mess as well to make the whole flow of the function simpler and clearer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19xfs: split out iclog writing from xfs_trans_commit()Dave Chinner
Split the the part of xfs_trans_commit() that deals with writing the transaction into the iclog into a separate function. This isolates the physical commit process from the logical commit operation and makes it easier to insert different transaction commit paths without affecting the existing algorithm adversely. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: fix reservation release commit flag in xfs_bmap_add_attrfork()Dave Chinner
xfs_bmap_add_attrfork() passes XFS_TRANS_PERM_LOG_RES to xfs_trans_commit() to indicate that the commit should release the permanent log reservation as part of the commit. This is wrong - the correct flag is XFS_TRANS_RELEASE_LOG_RES - and it is only by the chance that both these flags have the value of 0x4 that the code is doing the right thing. Fix it by changing to use the correct flag. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: remove stale parameter from ->iop_unpin methodDave Chinner
The staleness of a object being unpinned can be directly derived from the object itself - there is no need to extract it from the object then pass it as a parameter into IOP_UNPIN(). This means we can kill the XFS_LID_BUF_STALE flag - it is set, checked and cleared in the same places XFS_BLI_STALE flag in the xfs_buf_log_item so it is now redundant and hence safe to remove. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: Add inode pin counts to tracesDave Chinner
We don't record pin counts in inode events right now, and this makes it difficult to track down problems related to pinning inodes. Add the pin count to the inode trace class and add trace events for pinning and unpinning inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: factor log item initialisationDave Chinner
Each log item type does manual initialisation of the log item. Delayed logging introduces new fields that need initialisation, so factor all the open coded initialisation into a common function first. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: add blockdev name to kthreadsJan Engelhardt
This allows to see in `ps` and similar tools which kthreads are allotted to which block device/filesystem, similar to what jbd2 does. As the process name is a fixed 16-char array, no extra space is needed in tasks. PID TTY STAT TIME COMMAND 2 ? S 0:00 [kthreadd] 197 ? S 0:00 \_ [jbd2/sda2-8] 198 ? S 0:00 \_ [ext4-dio-unwrit] 204 ? S 0:00 \_ [flush-8:0] 2647 ? S 0:00 \_ [xfs_mru_cache] 2648 ? S 0:00 \_ [xfslogd/0] 2649 ? S 0:00 \_ [xfsdatad/0] 2650 ? S 0:00 \_ [xfsconvertd/0] 2651 ? S 0:00 \_ [xfsbufd/ram0] 2652 ? S 0:00 \_ [xfsaild/ram0] 2653 ? S 0:00 \_ [xfssyncd/ram0] Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19xfs: Fix integer overflow in fs/xfs/linux-2.6/xfs_ioctl*.cZhitong Wang
The am_hreq.opcount field in the xfs_attrmulti_by_handle() interface is not bounded correctly. The opcount is used to determine the size of the buffer required. The size is bounded, but can overflow and so the size checks may not be sufficient to catch invalid opcounts. Fix it by catching opcount values that would cause overflows before calculating the size. Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com> Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: check for read permission on src file in the clone ioctl
2010-05-15Btrfs: check for read permission on src file in the clone ioctlDan Rosenberg
The existing code would have allowed you to clone a file that was only open for writing Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: JFS: Free sbi memory in error path fs/sysv: dereferencing ERR_PTR() Fix double-free in logfs Fix the regression created by "set S_DEAD on unlink()..." commit
2010-05-15JFS: Free sbi memory in error pathJan Blunck
I spotted the missing kfree() while removing the BKL. [akpm@linux-foundation.org: avoid multiple returns so it doesn't happen again] Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-15fs/sysv: dereferencing ERR_PTR()Dan Carpenter
I moved the dir_put_page() inside the if condition so we don't dereference "page", if it's an ERR_PTR(). Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-15Fix double-free in logfsAl Viro
iput() is needed *until* we'd done successful d_alloc_root() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-15Fix the regression created by "set S_DEAD on unlink()..." commitAl Viro
1) i_flags simply doesn't work for mount/unlink race prevention; we may have many links to file and rm on one of those obviously shouldn't prevent bind on top of another later on. To fix it right way we need to mark _dentry_ as unsuitable for mounting upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and i_mutex on the inode in question. Set it (with dont_mount(dentry)) in unlink/rmdir/etc., check (with cant_mount(dentry)) in places in namespace.c that used to check for S_DEAD. Setting S_DEAD is still needed in places where we used to set it (for directories getting killed), since we rely on it for readdir/rmdir race prevention. 2) rename()/mount() protection has another bogosity - we unhash the target before we'd checked that it's not a mountpoint. Fixed. 3) ancient bogosity in pivot_root() - we locked i_mutex on the right directory, but checked S_DEAD on the different (and wrong) one. Noticed and fixed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-14Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notifyLinus Torvalds
* 'for-linus' of git://git.infradead.org/users/eparis/notify: inotify: don't leak user struct on inotify release inotify: race use after free/double free in inotify inode marks inotify: clean up the inotify_add_watch out path Inotify: undefined reference to `anon_inode_getfd' Manual merge to remove duplicate "select ANON_INODES" from Kconfig file
2010-05-14inotify: don't leak user struct on inotify releasePavel Emelyanov
inotify_new_group() receives a get_uid-ed user_struct and saves the reference on group->inotify_data.user. The problem is that free_uid() is never called on it. Issue seem to be introduced by 63c882a0 (inotify: reimplement inotify using fsnotify) after 2.6.30. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Eric Paris <eparis@parisplace.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Paris <eparis@redhat.com>
2010-05-14inotify: race use after free/double free in inotify inode marksEric Paris
There is a race in the inotify add/rm watch code. A task can find and remove a mark which doesn't have all of it's references. This can result in a use after free/double free situation. Task A Task B ------------ ----------- inotify_new_watch() allocate a mark (refcnt == 1) add it to the idr inotify_rm_watch() inotify_remove_from_idr() fsnotify_put_mark() refcnt hits 0, free take reference because we are on idr [at this point it is a use after free] [time goes on] refcnt may hit 0 again, double free The fix is to take the reference BEFORE the object can be found in the idr. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: <stable@kernel.org>
2010-05-14inotify: clean up the inotify_add_watch out pathEric Paris
inotify_add_watch explictly frees the unused inode mark, but it can just use the generic code. Just do that. Signed-off-by: Eric Paris <eparis@redhat.com>
2010-05-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: guard against hardlinking directories
2010-05-13vfs: Fix O_NOFOLLOW behavior for paths with trailing slashesJan Kara
According to specification mkdir d; ln -s d a; open("a/", O_NOFOLLOW | O_RDONLY) should return success but currently it returns ELOOP. This is a regression caused by path lookup cleanup patch series. Fix the code to ignore O_NOFOLLOW in case the provided path has trailing slashes. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Marius Tolzmann <tolzmann@molgen.mpg.de> Acked-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: preserve seq # on requeued messages after transient transport errors ceph: fix cap removal races ceph: zero unused message header, footer fields ceph: fix locking for waking session requests after reconnect ceph: resubmit requests on pg mapping change (not just primary change) ceph: fix open file counting on snapped inodes when mds returns no caps ceph: unregister osd request on failure ceph: don't use writeback_control in writepages completion ceph: unregister bdi before kill_anon_super releases device name
2010-05-12CacheFiles: Fix error handling in cachefiles_determine_cache_security()David Howells
cachefiles_determine_cache_security() is expected to return with a security override in place. However, if set_create_files_as() fails, we fail to do this. In this case, we should just reinstate the security override that was set by the caller. Furthermore, if set_create_files_as() fails, we should dispose of the new credentials we were in the process of creating. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-12Inotify: undefined reference to `anon_inode_getfd'Russell King
Fix: fs/built-in.o: In function `sys_inotify_init1': summary.c:(.text+0x347a4): undefined reference to `anon_inode_getfd' found by kautobuild with arms bcmring_defconfig, which ends up with INOTIFY_USER enabled (through the 'default y') but leaves ANON_INODES unset. However, inotify_user.c uses anon_inode_getfd(). Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Eric Paris <eparis@redhat.com>
2010-05-11ceph: preserve seq # on requeued messages after transient transport errorsSage Weil
If the tcp connection drops and we reconnect to reestablish a stateful session (with the mds), we need to resend previously sent (and possibly received) messages with the _same_ seq # so that they can be dropped on the other end if needed. Only assign a new seq once after the message is queued. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11ceph: fix cap removal racesSage Weil
The iterate_session_caps helper traverses the session caps list and tries to grab an inode reference. However, the __ceph_remove_cap was clearing the inode backpointer _before_ removing itself from the session list, causing a null pointer dereference. Clear cap->ci under protection of s_cap_lock to avoid the race, and to tightly couple the list and backpointer state. Use a local flag to indicate whether we are releasing the cap, as cap->session may be modified by a racing thread in iterate_session_caps. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11revert "procfs: provide stack information for threads" and its fixup commitsRobin Holt
Originally, commit d899bf7b ("procfs: provide stack information for threads") attempted to introduce a new feature for showing where the threadstack was located and how many pages are being utilized by the stack. Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was applied to fix the NO_MMU case. Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") was applied to fix a bug in ia32 executables being loaded. Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel threads") was applied to fix a bug which had kernel threads printing a userland stack address. Commit 1306d603f ('proc: partially revert "procfs: provide stack information for threads"') was then applied to revert the stack pages being used to solve a significant performance regression. This patch nearly undoes the effect of all these patches. The reason for reverting these is it provides an unusable value in field 28. For x86_64, a fork will result in the task->stack_start value being updated to the current user top of stack and not the stack start address. This unpredictability of the stack_start value makes it worthless. That includes the intended use of showing how much stack space a thread has. Other architectures will get different values. As an example, ia64 gets 0. The do_fork() and copy_process() functions appear to treat the stack_start and stack_size parameters as architecture specific. I only partially reverted c44972f1 ("procfs: disable per-task stack usage on NOMMU") . If I had completely reverted it, I would have had to change mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is configured. Since I could not test the builds without significant effort, I decided to not change mm/Makefile. I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") . I left the KSTK_ESP() change in place as that seemed worthwhile. Signed-off-by: Robin Holt <holt@sgi.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11ceph: zero unused message header, footer fieldsSage Weil
We shouldn't leak any prior memory contents to other parties. And random data, particularly in the 'version' field, can cause problems down the line. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11cifs: guard against hardlinking directoriesJeff Layton
When we made serverino the default, we trusted that the field sent by the server in the "uniqueid" field was actually unique. It turns out that it isn't reliably so. Samba, in particular, will just put the st_ino in the uniqueid field when unix extensions are enabled. When a share spans multiple filesystems, it's quite possible that there will be collisions. This is a server bug, but when the inodes in question are a directory (as is often the case) and there is a collision with the root inode of the mount, the result is a kernel panic on umount. Fix this by checking explicitly for directory inodes with the same uniqueid. If that is the case, then we can assume that using server inode numbers will be a problem and that they should be disabled. Fixes Samba bugzilla 7407 Signed-off-by: Jeff Layton <jlayton@redhat.com> CC: Stable <stable@kernel.org> Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-05-11CacheFiles: Fix occasional EIO on call to vfs_unlink()David Howells
Fix an occasional EIO returned by a call to vfs_unlink(): [ 4868.465413] CacheFiles: I/O Error: Unlink failed [ 4868.465444] FS-Cache: Cache cachefiles stopped due to I/O error [ 4947.320011] CacheFiles: File cache on md3 unregistering [ 4947.320041] FS-Cache: Withdrawing cache "mycache" [ 5127.348683] FS-Cache: Cache "mycache" added (type cachefiles) [ 5127.348716] CacheFiles: File cache on md3 registered [ 7076.871081] CacheFiles: I/O Error: Unlink failed [ 7076.871130] FS-Cache: Cache cachefiles stopped due to I/O error [ 7116.780891] CacheFiles: File cache on md3 unregistering [ 7116.780937] FS-Cache: Withdrawing cache "mycache" [ 7296.813394] FS-Cache: Cache "mycache" added (type cachefiles) [ 7296.813432] CacheFiles: File cache on md3 registered What happens is this: (1) A cached NFS file is seen to have become out of date, so NFS retires the object and immediately acquires a new object with the same key. (2) Retirement of the old object is done asynchronously - so the lookup/create to generate the new object may be done first. This can be a problem as the old object and the new object must exist at the same point in the backing filesystem (i.e. they must have the same pathname). (3) The lookup for the new object sees that a backing file already exists, checks to see whether it is valid and sees that it isn't. It then deletes that file and creates a new one on disk. (4) The retirement phase for the old file is then performed. It tries to delete the dentry it has, but ext4_unlink() returns -EIO because the inode attached to that dentry no longer matches the inode number associated with the filename in the parent directory. The trace below shows this quite well. [md5sum] ==> __fscache_relinquish_cookie(ffff88002d12fb58{NFS.fh,ffff88002ce62100},1) [md5sum] ==> __fscache_acquire_cookie({NFS.server},{NFS.fh},ffff88002ce62100) NFS has retired the old cookie and asked for a new one. [kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_ACTIVE,24}) [kslowd] <== fscache_object_state_machine() [->OBJECT_DYING] [kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_INIT,0}) [kslowd] <== fscache_object_state_machine() [->OBJECT_LOOKING_UP] [kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_DYING,24}) [kslowd] <== fscache_object_state_machine() [->OBJECT_RECYCLING] The old object (OBJ52) is going through the terminal states to get rid of it, whilst the new object - (OBJ53) - is coming into being. [kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_LOOKING_UP,0}) [kslowd] ==> cachefiles_walk_to_object({ffff88003029d8b8},OBJ53,@68,) [kslowd] lookup '@68' [kslowd] next -> ffff88002ce41bd0 positive [kslowd] advance [kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA' [kslowd] next -> ffff8800369faac8 positive The new object has looked up the subdir in which the file would be in (getting dentry ffff88002ce41bd0) and then looked up the file itself (getting dentry ffff8800369faac8). [kslowd] validate 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA' [kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA') [kslowd] remove ffff8800369faac8 from ffff88002ce41bd0 [kslowd] unlink stale object [kslowd] <== cachefiles_bury_object() = 0 It then checks the file's xattrs to see if it's valid. NFS says that the auxiliary data indicate the file is out of date (obvious to us - that's why NFS ditched the old version and got a new one). CacheFiles then deletes the old file (dentry ffff8800369faac8). [kslowd] redo lookup [kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA' [kslowd] next -> ffff88002cd94288 negative [kslowd] create -> ffff88002cd94288{ffff88002cdaf238{ino=148247}} CacheFiles then redoes the lookup and gets a negative result in a new dentry (ffff88002cd94288) which it then creates a file for. [kslowd] ==> cachefiles_mark_object_active(,OBJ53) [kslowd] <== cachefiles_mark_object_active() = 0 [kslowd] === OBTAINED_OBJECT === [kslowd] <== cachefiles_walk_to_object() = 0 [148247] [kslowd] <== fscache_object_state_machine() [->OBJECT_AVAILABLE] The new object is then marked active and the state machine moves to the available state - at which point NFS can start filling the object. [kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_RECYCLING,20}) [kslowd] ==> fscache_release_object() [kslowd] ==> cachefiles_drop_object({OBJ52,2}) [kslowd] ==> cachefiles_delete_object(,OBJ52{ffff8800369faac8}) The old object, meanwhile, goes on with being retired. If allocation occurs first, cachefiles_delete_object() has to wait for dir->d_inode->i_mutex to become available before it can continue. [kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA') [kslowd] remove ffff8800369faac8 from ffff88002ce41bd0 [kslowd] unlink stale object EXT4-fs warning (device sda6): ext4_unlink: Inode number mismatch in unlink (148247!=148193) CacheFiles: I/O Error: Unlink failed FS-Cache: Cache cachefiles stopped due to I/O error CacheFiles then tries to delete the file for the old object, but the dentry it has (ffff8800369faac8) no longer points to a valid inode for that directory entry, and so ext4_unlink() returns -EIO when de->inode does not match i_ino. [kslowd] <== cachefiles_bury_object() = -5 [kslowd] <== cachefiles_delete_object() = -5 [kslowd] <== fscache_object_state_machine() [->OBJECT_DEAD] [kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_AVAILABLE,0}) [kslowd] <== fscache_object_state_machine() [->OBJECT_ACTIVE] (Note that the above trace includes extra information beyond that produced by the upstream code). The fix is to note when an object that is being retired has had its object deleted preemptively by a replacement object that is being created, and to skip the second removal attempt in such a case. Reported-by: Greg M <gregm@servu.net.au> Reported-by: Mark Moseley <moseleymark@gmail.com> Reported-by: Romain DEGEZ <romain.degez@smartjog.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11ceph: fix locking for waking session requests after reconnectSage Weil
The session->s_waiting list is protected by mdsc->mutex, not s_mutex. This was causing (rare) s_waiting list corruption. Fix errors paths too, while we're here. A more thorough cleanup of this function is coming soon. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11ceph: resubmit requests on pg mapping change (not just primary change)Sage Weil
OSD requests need to be resubmitted on any pg mapping change, not just when the pg primary changes. Resending only when the primary changes results in occasional 'hung' requests during osd cluster recovery or rebalancing. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11ceph: fix open file counting on snapped inodes when mds returns no capsSage Weil
It's possible the MDS will not issue caps on a snapped inode, in which case an open request may not __ceph_get_fmode(), botching the open file counting. (This is actually a server bug, but the client shouldn't BUG out in this case.) Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11ceph: unregister osd request on failureSage Weil
The osd request wasn't being unregistered when the osd returned a failure code, even though the result was returned to the caller. This would cause it to eventually time out, and then crash the kernel when it tried to resend the request using a stale page vector. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-10autofs4-2.6.34-rc1 - fix link_count usageIan Kent
After commit 1f36f774b2 ("Switch !O_CREAT case to use of do_last()") in 2.6.34-rc1 autofs direct mounts stopped working. This is caused by current->link_count being 0 when ->follow_link() is called from do_filp_open(). I can't work out why this hasn't been seen before Als patch series. This patch removes the autofs dependence on current->link_count. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>