summaryrefslogtreecommitdiffstats
path: root/fs/xfs
AgeCommit message (Collapse)Author
2014-09-23xfs: track collapse via file offset rather than extent indexBrian Foster
The collapse range implementation uses a transaction per extent shift. The progress of the overall operation is tracked via the current extent index of the in-core extent list. This is racy because the ilock must be dropped and reacquired for each transaction according to locking and log reservation rules. Therefore, writeback to prior regions of the file is possible and can change the extent count. This changes the extent to which the current index refers and causes the collapse to fail mid operation. To avoid this problem, the entire file is currently written back before the collapse operation starts. To eliminate the need to flush the entire file, use the file offset (fsb) to track the progress of the overall extent shift operation rather than the extent index. Modify xfs_bmap_shift_extents() to unconditionally convert the start_fsb parameter to an extent index and return the file offset of the extent where the shift left off, if further extents exist. The bulk of ths function can remain based on extent index as ilock is held by the caller. xfs_collapse_file_space() now uses the fsb output as the starting point for the subsequent shift. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: ensure WB_SYNC_ALL writeback handles partial pages correctlyDave Chinner
XFS has been having trouble with stray delayed allocation extents beyond EOF for a long time. Recent changes to the collapse range code has triggered erroneous EBUSY errors on page invalidtion for block size smaller than page size filesystems. These have been caused by dirty buffers beyond EOF on a partial page which do not get written to disk during a sync. The issue is that write-ahead in xfs_cluster_write() finds such a partial page and handles it by leaving the page dirty but pushing it into a writeback state. This used to work just fine, as the write_cache_pages() code would then find the dirty partial page in the next mapping tree lookup as the dirty tag is still set. Unfortunately, when we moved to a mark and sweep approach to writeback to fix other writeback sync issues, we broken this. THe act of marking the page as under writeback now clears the TOWRITE tag in the radix tree, even though the page is still dirty. This causes the TOWRITE tag to be cleared, and hence the next lookup on the mapping tree does not find the dirty partial page and so doesn't try to write it again. This same writeback bug was found recently in ext4 and fixed in commit 1c8349a ("ext4: fix data integrity sync in ordered mode") without communication to the wider filesystem community. We can use exactly the same fix here so the TOWRITE flag is not cleared on partial page writes. cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e Root-cause-found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-22Merge branch 'for-linus' into for-3.18/coreJens Axboe
Moving patches from for-linus to 3.18 instead, pull in this changes that will go to Linus today.
2014-09-09Merge branch 'xfs-misc-fixes-for-3.18-1' into for-nextDave Chinner
2014-09-09xfs: remove rbpp check from xfs_rtmodify_summary_intEric Sandeen
rbpp is always passed into xfs_rtmodify_summary and xfs_rtget_summary, so there is no need to test for it in xfs_rtmodify_summary_int. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: combine xfs_rtmodify_summary and xfs_rtget_summaryEric Sandeen
xfs_rtmodify_summary and xfs_rtget_summary are almost identical; fold them into xfs_rtmodify_summary_int(), with wrappers for each of the original calls. The _int function modifies if a delta is passed, and returns a summary pointer if *sum is passed. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: combine xfs_dir_canenter into xfs_dir_createnameEric Sandeen
xfs_dir_canenter and xfs_dir_createname are almost identical. Fold the former into the latter, with a helpful wrapper for the former. If createname is called without an inode number, it now only checks for space, and does not actually add the entry. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: check resblks before calling xfs_dir_canenterEric Sandeen
Move the resblks test out of the xfs_dir_canenter, and into the caller. This makes a little more sense on the face of it; xfs_dir_canenter immediately returns if resblks !=0; and given some of the comments preceding the calls: * Check for ability to enter directory entry, if no space reserved. even more so. It also facilitates the next patch. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: deduplicate xlog_do_recovery_pass()Eric Sandeen
In xlog_do_recovery_pass(), there are 2 distinct cases: non-wrapped and wrapped log recovery. If we find a wrapped log, we recover around the end of the log, and then handle the rest of recovery exactly as in the non-wrapped case - using exactly the same (duplicated) code. Rather than having the same code in both cases, we can get the wrapped portion out of the way first if needed, and then recover the non-wrapped portion of the log. There should be no functional change here, just code reorganization & deduplication. The patch looks a bit bigger than it really is; the last hunk is whitespace changes (un-indenting). Tested with xfstests "check -g log" on a stock configuration. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: lseek: the "whence" argument is called "whence"Eric Sandeen
For some reason, the older commit: 965c8e5 lseek: the "whence" argument is called "whence" lseek: the "whence" argument is called "whence" But the kernel decided to call it "origin" instead. Fix most of the sites. left out xfs. So fix xfs. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: combine xfs_seek_hole & xfs_seek_dataEric Sandeen
xfs_seek_hole & xfs_seek_data are remarkably similar; so much so that they can be combined, saving a fair bit of semi-complex code duplication. The following patch passes generic/285 and generic/286, which specifically test seek behavior. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: export log_recovery_delay to delay mount time log recoveryBrian Foster
XFS log recovery has been discovered to have race conditions with buffers when I/O errors occur. External tools are available to simulate I/O errors to XFS, but this alone is not sufficient for testing log recovery. XFS unconditionally resets the inactive region of the log prior to log recovery to avoid confusion over processing any partially written log records that might have been written before an unclean shutdown. Therefore, unconditional write I/O failures at mount time are caught by the reset sequence rather than log recovery and hinder the ability to test the latter. The device-mapper dm-flakey module uses an up/down timer to define a cycle for when to fail I/Os. Create a pre log recovery delay tunable that can be used to coordinate XFS log recovery with I/O errors simulated by dm-flakey. This facilitates coordination in userspace that allows the reset of stale log blocks to succeed and writes due to log recovery to fail. For example, define a dm-flakey instance with an uptime long enough to allow log reset to succeed and a log recovery delay long enough to allow the dm-flakey uptime to expire. The 'log_recovery_delay' sysfs tunable is exported under /sys/fs/xfs/debug and is only enabled for kernels compiled in XFS debug mode. The value is exported in units of seconds and allows for a delay of up to 60 seconds. Note that this is for XFS debug and test instrumentation purposes only and should not be used by applications. No delay is enabled by default. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: add debug sysfs attribute setBrian Foster
Create a top-level debug directory for global debug sysfs attributes. This directory is added and removed on XFS module initialization and removal respectively for DEBUG mode kernels only. It typically resides at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs hierarchy as attributes might define global behavior or behavior that must be configured before an xfs mount is available (e.g., log recovery behavior). Define the global debug kobject that represents the debug sysfs directory and add generic attribute show/store helpers to support future attributes. No debug attributes are exported as of yet. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: add a few more verifier testsEric Sandeen
These were exposed by fsfuzzer runs; without them we fail in various exciting and sometimes convoluted ways when we encounter disk corruption. Without the MAXLEVELS tests we tend to walk off the end of an array in a loop like this: for (i = 0; i < cur->bc_nlevels; i++) { if (cur->bc_bufs[i]) Without the dirblklog test we try to allocate more memory than we could possibly hope for and loop forever: xfs_dabuf_map() nfsb = mp->m_dir_geo->fsbcount; irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP... As for the logbsize check, that's the convoluted one. If logbsize is specified at mount time, it's sanitized in xfs_parseargs; in particular it makes sure that it's not > XLOG_MAX_RECORD_BSIZE. If not specified at mount time, it comes from the superblock via sb_logsunit; this is limited to 256k at mkfs time as well; it's copied into m_logbsize in xfs_finish_flags(). However, if for some reason the on-disk value is corrupt and too large, nothing catches it. It's a circuitous path, but that size eventually finds its way to places that make the kernel very unhappy, leading to oopses in xlog_pack_data() because we use the size as an index into iclog->ic_data, but the array is not necessarily that big. Anyway - bounds checking when we read from disk is a good thing! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09xfs: mark all internal workqueues as freezableBrian Foster
Workqueues must be explicitly set as freezable to ensure they are frozen in the assocated part of the hibernation/suspend sequence. Freezing of workqueues and kernel threads is important to ensure that modifications are not made on-disk after the hibernation image has been created. Otherwise, the in-memory state can become inconsistent with what is on disk and eventually lead to filesystem corruption. We have reports of free space btree corruptions that occur immediately after restore from hibernate that suggest the xfs-eofblocks workqueue could be causing such problems if it races with hibernation. Mark all of the internal XFS workqueues as freezable to ensure nothing changes on-disk once the freezer infrastructure freezes kernel threads and creates the hibernation image. Signed-off-by: Brian Foster <bfoster@redhat.com> Reported-by: Carlos E. R. <carlos.e.r@opensuse.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-08block, bdi: an active gendisk always has a request_queue associated with itTejun Heo
bdev_get_queue() returns the request_queue associated with the specified block_device. blk_get_backing_dev_info() makes use of bdev_get_queue() to determine the associated bdi given a block_device. All the callers of bdev_get_queue() including blk_get_backing_dev_info() assume that bdev_get_queue() may return NULL and implement NULL handling; however, bdev_get_queue() requires the passed in block_device is opened and attached to its gendisk. Because an active gendisk always has a valid request_queue associated with it, bdev_get_queue() can never return NULL and neither can blk_get_backing_dev_info(). Make it clear that neither of the two functions can return NULL and remove NULL handling from all the callers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-02xfs: trim eofblocks before collapse rangeBrian Foster
xfs_collapse_file_space() currently writes back the entire file undergoing collapse range to settle things down for the extent shift algorithm. While this prevents changes to the extent list during the collapse operation, the writeback itself is not enough to prevent unnecessary collapse failures. The current shift algorithm uses the extent index to iterate the in-core extent list. If a post-eof delalloc extent persists after the writeback (e.g., a prior zero range op where the end of the range aligns with eof can separate the post-eof blocks such that they are not written back and converted), xfs_bmap_shift_extents() becomes confused over the encoded br_startblock value and fails the collapse. As with the full writeback, this is a temporary fix until the algorithm is improved to cope with a volatile extent list and avoid attempts to shift post-eof extents. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: xfs_file_collapse_range is delalloc challengedDave Chinner
If we have delalloc extents on a file before we run a collapse range opertaion, we sync the range that we are going to collapse to convert delalloc extents in that region to real extents to simplify the shift operation. However, the shift operation then assumes that the extent list is not going to change as it iterates over the extent list moving things about. Unfortunately, this isn't true because we can't hold the ILOCK over all the operations. We can prevent new IO from modifying the extent list by holding the IOLOCK, but that doesn't prevent writeback from running.... And when writeback runs, it can convert delalloc extents is the range of the file prior to the region being collapsed, and this changes the indexes of all the extents in the file. That causes the collapse range operation to Go Bad. The right fix is to rewrite the extent shift operation not to be dependent on the extent list not changing across the entire operation, but this is a fairly significant piece of work to do. Hence, as a short-term workaround for the problem, sync the entire file before starting a collapse operation to remove all delalloc ranges from the file and so avoid the problem of concurrent writeback changing the extent list. Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't log inode unless extent shift makes extent modificationsBrian Foster
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse all subsequent extents down into the specified, previously punched out, region. This function performs some validation, such as whether a sufficient hole exists in the target region of the collapse, then shifts the remaining exents downward. The exit path of the function currently logs the inode unconditionally. While we must log the inode (and abort) if an error occurs and the transaction is dirty, the initial validation paths can generate errors before the transaction has been dirtied. This creates an unnecessary filesystem shutdown scenario, as the caller will cancel a transaction that has been marked dirty. Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications are made to the inode bmap. Only log the inode in the exit path if logflags has been set. This ensures we only have to cancel a dirty transaction if modifications have been made and prevents an unnecessary filesystem shutdown otherwise. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: use ranged writeback and invalidation for direct IODave Chinner
Now we are not doing silly things with dirtying buffers beyond EOF and using invalidation correctly, we can finally reduce the ranges of writeback and invalidation used by direct IO to match that of the IO being issued. Bring the writeback and invalidation ranges back to match the generic direct IO code - this will greatly reduce the perturbation of cached data when direct IO and buffered IO are mixed, but still provide the same buffered vs direct IO coherency behaviour we currently have. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't zero partial page cache pages during O_DIRECT writesDave Chinner
Similar to direct IO reads, direct IO writes are using truncate_pagecache_range to invalidate the page cache. This is incorrect due to the sub-block zeroing in the page cache that truncate_pagecache_range() triggers. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. cc: stable@vger.kernel.org Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't zero partial page cache pages during O_DIRECT writesChris Mason
xfs is using truncate_pagecache_range to invalidate the page cache during DIO reads. This is different from the other filesystems who only invalidate pages during DIO writes. truncate_pagecache_range is meant to be used when we are freeing the underlying data structs from disk, so it will zero any partial ranges in the page. This means a DIO read can zero out part of the page cache page, and it is possible the page will stay in cache. buffered reads will find an up to date page with zeros instead of the data actually on disk. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. [dchinner: catch error and warn if it fails. Comment.] cc: stable@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't dirty buffers beyond EOFDave Chinner
generic/263 is failing fsx at this point with a page spanning EOF that cannot be invalidated. The operations are: 1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes) 1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes) 1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes) where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO write attempts to invalidate the cached page over this range, it fails with -EBUSY and so any attempt to do page invalidation fails. The real question is this: Why can't that page be invalidated after it has been written to disk and cleaned? Well, there's data on the first two buffers in the page (1k block size, 4k page), but the third buffer on the page (i.e. beyond EOF) is failing drop_buffers because it's bh->b_state == 0x3, which is BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say what? OK, set_buffer_dirty() is called on all buffers from __set_page_buffers_dirty(), regardless of whether the buffer is beyond EOF or not, which means that when we get to ->writepage, we have buffers marked dirty beyond EOF that we need to clean. So, we need to implement our own .set_page_dirty method that doesn't dirty buffers beyond EOF. This is messy because the buffer code is not meant to be shared and it has interesting locking issues on the buffer dirty bits. So just copy and paste it and then modify it to suit what we need. Note: the solutions the other filesystems and generic block code use of marking the buffers clean in ->writepage does not work for XFS. It still leaves dirty buffers beyond EOF and invalidations still fail. Hence rather than play whack-a-mole, this patch simply prevents those buffers from being dirtied in the first place. cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04Merge branch 'xfs-misc-fixes-3.17-2' into for-nextDave Chinner
2014-08-04Merge branch 'xfs-bulkstat-refactor' into for-nextDave Chinner
2014-08-04Merge branch 'xfs-misc-fixes-3.17-1' into for-nextDave Chinner
2014-08-04Merge branch 'xfs-quota-eofblocks-scan' into for-nextDave Chinner
2014-08-04xfs: fix coccinelle warningskbuild test robot
Removes unneeded semicolon, introduced by commit a70a4fa5 ("xfs: fix a couple error sequence jumps in xfs_mountfs"): fs/xfs/xfs_mount.c:858:24-25: Unneeded semicolon Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: flush both inodes in xfs_swap_extentsDave Chinner
We need to treat both inodes identically from a page cache point of view when prepareing them for extent swapping. We don't do this right now - we assume that one of the inodes empty, because that's what xfs_fsr currently does. Remove this assumption from the code. While factoring out the flushing and related checks, move the transactions reservation to immeidately after the flushes so that we don't need to pick up and then drop the ilock to do the transaction reservation. There are no issues with aborting the transaction it if the checks fail before we join the inodes to the transaction and dirty them, so this is a safe change to make. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: fix swapext ilock deadlockDave Chinner
xfs_swap_extents() holds the ilock over a call to filemap_write_and_wait(), which can then try to write data and take the ilock. That causes a self-deadlock. Fix the deadlock and clean up the code by separating the locking appropriately. Add a lockflags variable to track what locks we are holding as we gain and drop them and cleanup the error handling to always use "out_unlock" with the lockflags variable. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: kill xfs_vnode.hDave Chinner
Move the IO flag definitions to xfs_inode.h and kill the header file as it is now empty. Removing the xfs_vnode.h file showed up an implicit header include path: xfs_linux.h -> xfs_vnode.h -> xfs_fs.h And so every xfs header file has been inplicitly been including xfs_fs.h where it is needed or not. Hence the removal of xfs_vnode.h causes all sorts of build issues because BBTOB() and friends are no longer automatically included in the build. This also gets fixed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: kill VN_MAPPEDDave Chinner
Only one user, no longer needed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: kill VN_CACHEDDave Chinner
Only has 2 users, has outlived it's usefulness. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: kill VN_DIRTY()Dave Chinner
Only one user of the macro and the dirty mapping check is redundant so just get rid of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: dquot recovery needs verifiersDave Chinner
dquot recovery should add verifiers to the dquot buffers that it recovers changes into. Unfortunately, it doesn't attached the verifiers to the buffers in a consistent manner. For example, xlog_recover_dquot_pass2() reads dquot buffers without a verifier and then writes it without ever having attached a verifier to the buffer. Further, dquot buffer recovery may write a dquot buffer that has not been modified, or indeed, shoul dbe written because quotas are not enabled and hence changes to the buffer were not replayed. In this case, we again write buffers without verifiers attached because that doesn't happen until after the buffer changes have been replayed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: quotacheck leaves dquot buffers without verifiersDave Chinner
When running xfs/305, I noticed that quotacheck was flushing dquot buffers that did not have the xfs_dquot_buf_ops verifiers attached: XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8 ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00 DQ....e......... ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001 ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000 ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80 Call Trace: [<ffffffff81cf1cca>] dump_stack+0x45/0x56 [<ffffffff814d50ca>] _xfs_buf_ioapply+0x3ca/0x3d0 [<ffffffff810db520>] ? wake_up_state+0x20/0x20 [<ffffffff814d51f5>] ? xfs_bdstrat_cb+0x55/0xb0 [<ffffffff814d513b>] xfs_buf_iorequest+0x6b/0xd0 [<ffffffff814d51f5>] xfs_bdstrat_cb+0x55/0xb0 [<ffffffff814d53ab>] __xfs_buf_delwri_submit+0x15b/0x220 [<ffffffff814d6040>] ? xfs_buf_delwri_submit+0x30/0x90 [<ffffffff814d6040>] xfs_buf_delwri_submit+0x30/0x90 [<ffffffff8150f89d>] xfs_qm_quotacheck+0x17d/0x3c0 [<ffffffff81510591>] xfs_qm_mount_quotas+0x151/0x1e0 [<ffffffff814ed01c>] xfs_mountfs+0x56c/0x7d0 [<ffffffff814f0f12>] xfs_fs_fill_super+0x2c2/0x340 [<ffffffff811c9fe4>] mount_bdev+0x194/0x1d0 [<ffffffff814f0c50>] ? xfs_finish_flags+0x170/0x170 [<ffffffff814ef0f5>] xfs_fs_mount+0x15/0x20 [<ffffffff811ca8c9>] mount_fs+0x39/0x1b0 [<ffffffff811e4d67>] vfs_kern_mount+0x67/0x120 [<ffffffff811e757e>] do_mount+0x23e/0xad0 [<ffffffff8117abde>] ? __get_free_pages+0xe/0x50 [<ffffffff811e71e6>] ? copy_mount_options+0x36/0x150 [<ffffffff811e8103>] SyS_mount+0x83/0xc0 [<ffffffff81cfd40b>] tracesys+0xdd/0xe2 This was caused by dquot buffer readahead not attaching a verifier structure to the buffer when readahead was issued, resulting in the followup read of the buffer finding a valid buffer and so not attaching new verifiers to the buffer as part of the read. Also, when a verifier failure occurs, we then read the buffer without verifiers. Attach the verifiers manually after this read so that if the buffer is then written it will be verified that the corruption has been repaired. Further, when flushing a dquot we don't ask for a verifier when reading in the dquot buffer the dquot belongs to. Most of the time this isn't an issue because the buffer is still cached, but when it is not cached it will result in writing the dquot buffer without having the verfier attached. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: ensure verifiers are attached to recovered buffersDave Chinner
Crash testing of CRC enabled filesystems has resulted in a number of reports of bad CRCs being detected after the filesystem was mounted. Errors such as the following were being seen: XFS (sdb3): Mounting V5 Filesystem XFS (sdb3): Starting recovery (logdev: internal) XFS (sdb3): Metadata CRC error detected at xfs_agf_read_verify+0x5a/0x100 [xfs], block 0x1 XFS (sdb3): Unmount and run xfs_repair XFS (sdb3): First 64 bytes of corrupted metadata buffer: ffff880136ffd600: 58 41 47 46 00 00 00 01 00 00 00 00 00 0f aa 40 XAGF...........@ ffff880136ffd610: 00 02 6d 53 00 02 77 f8 00 00 00 00 00 00 00 01 ..mS..w......... ffff880136ffd620: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 03 ................ ffff880136ffd630: 00 00 00 04 00 08 81 d0 00 08 81 a7 00 00 00 00 ................ XFS (sdb3): metadata I/O error: block 0x1 ("xfs_trans_read_buf_map") error 74 numblks 1 The errors were typically being seen in AGF, AGI and their related btree block buffers some time after log recovery had run. Often it wasn't until later subsequent mounts that the problem was discovered. The common symptom was a buffer with the correct contents, but a CRC and an LSN that matched an older version of the contents. Some debug added to _xfs_buf_ioapply() indicated that buffers were being written without verifiers attached to them from log recovery, and Jan Kara isolated the cause to log recovery readahead an dit's interactions with buffers that had a more recent LSN on disk than the transaction being recovered. In this case, the buffer did not get a verifier attached, and os when the second phase of log recovery ran and recovered EFIs and unlinked inodes, the buffers were modified and written without the verifier running. Hence they had up to date contents, but stale LSNs and CRCs. Fix it by attaching verifiers to buffers we skip due to future LSN values so they don't escape into the buffer cache without the correct verifier attached. This patch is based on analysis and a patch from Jan Kara. cc: <stable@vger.kernel.org> Reported-by: Jan Kara <jack@suse.cz> Reported-by: Fanael Linithien <fanael4@gmail.com> Reported-by: Grozdan <neutrino8@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: catch buffers written without verifiers attachedDave Chinner
We recently had a bug where buffers were slipping through log recovery without any verifier attached to them. This was resulting in on-disk CRC mismatches for valid data. Add some warning code to catch this occurrence so that we catch such bugs during development rather than not being aware they exist. Note that we cannot do this verification unconditionally as non-CRC filesystems don't always attach verifiers to the buffers being written. e.g. during log recovery we cannot identify all the different types of buffers correctly on non-CRC filesystems, so we can't attach the correct verifiers in all cases and so we don't attach any. Hence we don't want on non-CRC filesystems to avoid spamming the logs with false indications. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: avoid false quotacheck after unclean shutdownEric Sandeen
The commit 83e782e xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD added a new function xfs_sb_quota_from_disk() which swaps on-disk XFS_OQUOTA_* flags for in-core XFS_GQUOTA_* and XFS_PQUOTA_* flags after the superblock is read. However, if log recovery is required, the superblock is read again, and the modified in-core flags are re-read from disk, so we have XFS_OQUOTA_* flags in memory again. This causes the XFS_QM_NEED_QUOTACHECK() test to be true, because the XFS_OQUOTA_CHKD is still set, and not XFS_GQUOTA_CHKD or XFS_PQUOTA_CHKD. Change xfs_sb_from_disk to call xfs_sb_quota_from disk and always convert the disk flags to in-memory flags. Add a lower-level function which can be called with "false" to not convert the flags, so that the sb verifier can verify exactly what was on disk, per Brian Foster's suggestion. Reported-by: Cyril B. <cbay@excellency.fr> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2014-08-04xfs: fix rounding error of fiemap length parameterBrian Foster
The offset and length parameters are converted from bytes to basic blocks by xfs_vn_fiemap(). The BTOBB() converter rounds the value up to the nearest basic block. This leads to unexpected behavior when unaligned offsets are provided to FIEMAP. Fix the conversions of byte values to block values to cover the provided offsets. Round down the start offset to the nearest basic block. Calculate the end offset based on the provided values, round up and calculate length based on the start block offset. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04xfs: introduce xfs_bulkstat_ag_ichunkJie Liu
Introduce xfs_bulkstat_ag_ichunk() to process inodes in chunk with a pointer to a formatter function that will iget the inode and fill in the appropriate structure. Refactor xfs_bulkstat() with it. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-30xfs: require 64-bit sector_tChristoph Hellwig
Trying to support tiny disks only and saving a bit memory might have made sense on an SGI O2 15 years ago, but is pretty pointless today. Remove the rarely tested codepath that uses various smaller in-memory types to reduce our test matrix and make the codebase a little bit smaller and less complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: fix uflags detection at xfs_fs_rm_xquotaJie Liu
We are intended to check up uflags against FS_PROJ_QUOTA rather than FS_USER_UQUOTA once more, it looks to me like a typo, but might cause the project quota metadata space can not be removed. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: remove XFS_IS_OQUOTA_ON macrosJie Liu
Remove the XFS_IS_OQUOTA_ON macros as it is obsoleted. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: tidy up xfs_set_inode32Eric Sandeen
xfs_set_inode32() caught my eye because it had weird spacing around the "-1's". In cleaning that up, I realized that the assignment in the declaration of "ino" is never used; it's rewritten before it gets read. Drop the ino initializer from its declaration since it's not used, and move the agino initialization into the body of the function, mostly so that we can have pretty whitespace and not exceed 80 columns. :) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: allow inode allocations in post-growfs disk spaceEric Sandeen
Today, if we perform an xfs_growfs which adds allocation groups, mp->m_maxagi is not properly updated when the growfs is complete. Therefore inodes will continue to be allocated only in the AGs which existed prior to the growfs, and the new space won't be utilized. This is because of this path in xfs_growfs_data_private(): xfs_growfs_data_private xfs_initialize_perag(mp, nagcount, &nagimax); if (mp->m_flags & XFS_MOUNT_32BITINODES) index = xfs_set_inode32(mp); else index = xfs_set_inode64(mp); if (maxagi) *maxagi = index; where xfs_set_inode* iterates over the (old) agcount in mp->m_sb.sb_agblocks, which has not yet been updated in the growfs path. So "index" will be returned based on the old agcount, not the new one, and new AGs are not available for inode allocation. Fix this by explicitly passing the proper AG count (which xfs_initialize_perag() already has) down another level, so that xfs_set_inode* can make the proper decision about acceptable AGs for inode allocation in the potentially newly-added AGs. This has been broken since 3.7, when these two xfs_set_inode* functions were added in commit 2d2194f. Prior to that, we looped over "agcount" not sb_agblocks in these calculations. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: mark xfs_qm_quotacheck as staticJie Liu
xfs_qm_quotacheck() is not used outside of xfs_qm.c. Mark it static and move it around in the file to avoid a forward declaration. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: fix cil push sequence after log recoveryMark Tinguely
When the CIL checkpoint is fully written to the log, the LSN of the checkpoint commit record is written into the CIL context structure. This allows log force waiters to correctly detect when the checkpoint they are waiting on have been fully written into the log buffers. However, the initial context after mount is initialised with a non-zero commit LSN, so appears to waiters as though it is complete even though it may not have even been pushed, let alone written to the log buffers. Hence a log force immediately after a filesystem is mounted may not behave correctly, nor does commit record ordering if multiple CIL pushes interleave immediately after mount. To fix this, make sure the initial context commit LSN is not touched until the first checkpointis actually pushed. [dchinner: rewrite commit message] Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: squash prealloc while over quota free space as wellBrian Foster
From: Brian Foster <bfoster@redhat.com> Commit 4d559a3b introduced heavy prealloc. squashing to catch the case of requesting too large a prealloc on smaller filesystems, leading to repeated flush and retry cycles that occur on ENOSPC. Now that we issue eofblocks scans on EDQUOT/ENOSPC, squash the prealloc against the minimum available free space across all applicable quotas as well to avoid a similar problem of repeated eofblocks scans. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24xfs: run an eofblocks scan on ENOSPC/EDQUOTBrian Foster
From: Brian Foster <bfoster@redhat.com> Speculative preallocation and and the associated throttling metrics assume we're working with large files on large filesystems. Users have reported inefficiencies in these mechanisms when we happen to be dealing with large files on smaller filesystems. This can occur because while prealloc throttling is aggressive under low free space conditions, it is not active until we reach 5% free space or less. For example, a 40GB filesystem has enough space for several files large enough to have multi-GB preallocations at any given time. If those files are slow growing, they might reserve preallocation for long periods of time as well as avoid the background scanner due to frequent modification. If a new file is written under these conditions, said file has no access to this already reserved space and premature ENOSPC is imminent. To handle this scenario, modify the buffered write ENOSPC handling and retry sequence to invoke an eofblocks scan. In the smaller filesystem scenario, the eofblocks scan resets the usage of preallocation such that when the 5% free space threshold is met, throttling effectively takes over to provide fair and efficient preallocation until legitimate ENOSPC. The eofblocks scan is selective based on the nature of the failure. For example, an EDQUOT failure in a particular quota will use a filtered scan for that quota. Because we don't know which quota might have caused an allocation failure at any given time, we include each applicable quota determined to be under low free space conditions in the scan. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>