summaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_bmap.c
AgeCommit message (Collapse)Author
2014-04-17xfs: collapse range is delalloc challengedDave Chinner
FSX has been detecting data corruption after to collapse range calls. The key observation is that the offset of the last extent in the file was not being shifted, and hence when the file size was adjusted it was truncating away data because the extents handled been correctly shifted. Tracing indicated that before the collapse, the extent list looked like: .... ino 0x5788 state idx 6 offset 26 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 39 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0 and after the shift of 2 blocks: ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0 Note that the last extent did not change offset. After the changing of the file size: ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 30 flag 0 You can see that the last extent had it's length truncated, indicating that we've lost data. The reason for this is that the xfs_bmap_shift_extents() loop uses XFS_IFORK_NEXTENTS() to determine how many extents are in the inode. This, unfortunately, doesn't take into account delayed allocation extents - it's a count of physically allocated extents - and hence when the file being collapsed has a delalloc extent like this one does prior to the range being collapsed: .... ino 0x5788 state idx 4 offset 11 block 4503599627239429 count 1 flag 0 .... it gets the count wrong and terminates the shift loop early. Fix it by using the in-memory extent array size that includes delayed allocation extents to determine the number of extents on the inode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-24xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocateNamjae Jeon
This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for XFS. The semantics of this flag are following: 1) It collapses the range lying between offset and length by removing any data blocks which are present in this range and than updates all the logical offsets of extents beyond "offset + len" to nullify the hole created by removing blocks. In short, it does not leave a hole. 2) It should be used exclusively. No other fallocate flag in combination. 3) Offset and length supplied to fallocate should be fs block size aligned in case of xfs and ext4. 4) Collaspe range does not work beyond i_size. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-01-09Merge branch 'xfs-extent-list-locking-fixes' into for-nextBen Myers
A set of fixes which makes sure we are taking the ilock whenever accessing the extent list. This was associated with "Access to block zero" messages which may result in extent list corruption.
2013-12-18xfs: assert that we hold the ilock for extent map accessChristoph Hellwig
Make sure that xfs_bmapi_read has the ilock held in some way, and that xfs_bmapi_write, xfs_bmapi_delay, xfs_bunmapi and xfs_iread_extents are called with the ilock held exclusively. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: swalloc doesn't align allocations properlyDave Chinner
When swalloc is specified as a mount option, allocations are supposed to be aligned to the stripe width rather than the stripe unit of the underlying filesystem. However, it does not do this. What the implementation does is round up the allocation size to a stripe width, hence ensuring that all allocations span a full stripe width. It does not, however, ensure that that allocation is aligned to a stripe width, and hence the allocations can span multiple underlying stripes and so still see RMW cycles for things like direct IO on MD RAID. So, if the swalloc mount option is set, change the allocation alignment in xfs_bmap_btalloc() to use the stripe width rather than the stripe unit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: align initial file allocations correctlyDave Chinner
The function xfs_bmap_isaeof() is used to indicate that an allocation is occurring at or past the end of file, and as such should be aligned to the underlying storage geometry if possible. Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the behaviour of this function for empty files - it turned off allocation alignment for this case accidentally. Hence large initial allocations from direct IO are not getting correctly aligned to the underlying geometry, and that is cause write performance to drop in alignment sensitive configurations. Fix it by considering allocation into empty files as requiring aligned allocation again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit f9b395a8ef8f34d19cae2cde361e19c96e097fad)
2013-12-11xfs: align initial file allocations correctlyDave Chinner
The function xfs_bmap_isaeof() is used to indicate that an allocation is occurring at or past the end of file, and as such should be aligned to the underlying storage geometry if possible. Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the behaviour of this function for empty files - it turned off allocation alignment for this case accidentally. Hence large initial allocations from direct IO are not getting correctly aligned to the underlying geometry, and that is cause write performance to drop in alignment sensitive configurations. Fix it by considering allocation into empty files as requiring aligned allocation again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-11-18xfs: fix unlock in xfs_bmap_add_attrforkMark Tinguely
xfs_trans_ijoin() activates the inode in a transaction and also can specify which lock to free when the transaction is committed or canceled. xfs_bmap_add_attrfork call locks and adds the lock to the transaction but also manually removes the lock. Change the routine to not add the lock to the transaction and manually remove lock on completion. While here, clean up the xfs_trans_cancel flags and goto names. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: decouple inode and bmap btree header filesDave Chinner
Currently the xfs_inode.h header has a dependency on the definition of the BMAP btree records as the inode fork includes an array of xfs_bmbt_rec_host_t objects in it's definition. Move all the btree format definitions from xfs_btree.h, xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to xfs_format.h to continue the process of centralising the on-disk format definitions. With this done, the xfs inode definitions are no longer dependent on btree header files. The enables a massive culling of unnecessary includes, with close to 200 #include directives removed from the XFS kernel code base. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: decouple log and transaction headersDave Chinner
xfs_trans.h has a dependency on xfs_log.h for a couple of structures. Most code that does transactions doesn't need to know anything about the log, but this dependency means that they have to include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header files and clean up the includes to be in dependency order. In doing this, remove the direct include of xfs_trans_reserve.h from xfs_trans.h so that we remove the dependency between xfs_trans.h and xfs_mount.h. Hence the xfs_trans.h include can be moved to the indicate the actual dependencies other header files have on it. Note that these are kernel only header files, so this does not translate to any userspace changes at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: unify directory/attribute format definitionsDave Chinner
The on-disk format definitions for the directory and attribute structures are spread across 3 header files right now, only one of which is dedicated to defining on-disk structures and their manipulation (xfs_dir2_format.h). Pull all the format definitions into a single header file - xfs_da_format.h - and switch all the code over to point at that. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: create a shared header file for format-related informationDave Chinner
All of the buffer operations structures are needed to be exported for xfs_db, so move them all to a common location rather than spreading them all over the place. They are verifying the on-disk format, so while xfs_format.h might be a good place, it is not part of the on disk format. Hence we need to create a new header file that we centralise these related definitions. Start by moving the bffer operations structures, and then also move all the other definitions that have crept into xfs_log_format.h and xfs_format.h as there was no other shared header file to put them in. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-17xfs: remove newlines from strings passed to __xfs_printkEric Sandeen
__xfs_printk adds its own "\n". Having it in the original string leads to unintentional blank lines from these messages. Most format strings have no newline, but a few do, leading to i.e.: [ 7347.119911] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119911] [ 7347.119919] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119919] Fix them all. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-09xfs: fix some minor sparse warningsDave Chinner
A couple of simple locking annotations and 0 vs NULL warnings. Nothing that changes any code behaviour, just removes build noise. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20xfs: fix the comment of xfs_bmap_last_before()Zhi Yong Wu
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20xfs: fix the comment of xfs_bmap_validate_ret()Zhi Yong Wu
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12xfs: refactor xfs_trans_reserve() interfaceJie Liu
With the new xfs_trans_res structure has been introduced, the log reservation size, log count as well as log flags are pre-initialized at mount time. So it's time to refine xfs_trans_reserve() interface to be more neat. Also, introduce a new helper M_RES() to return a pointer to the mp->m_resv structure to simplify the input. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12xfs: kill xfs_vnodeops.[ch]Dave Chinner
Now we have xfs_inode.c for holding kernel-only XFS inode operations, move all the inode operations from xfs_vnodeops.c to this new file as it holds another set of kernel-only inode operations. The name of this file traces back to the days of Irix and it's vnodes which we don't have anymore. Essentially this move consolidates the inode locking functions and a bunch of XFS inode operations into the one file. Eventually the high level functions will be merged into the VFS interface functions in xfs_iops.c. This leaves only internal preallocation, EOF block manipulation and hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c where we are already consolidating various in-kernel physical extent manipulation and querying functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12xfs: fix issues that cause userspace warningsDave Chinner
Some of the code shared with userspace causes compilation warnings from things turned off in the kernel code, such as differences in variable signedness. Fix those issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12xfs: create xfs_bmap_util.[ch]Dave Chinner
There is a bunch of code in xfs_bmap.c that is kernel specific and not shared with userspace. To minimise the difference between the kernel and userspace code, shift this unshared code to xfs_bmap_util.c, and the declarations to xfs_bmap_util.h. The biggest issue here is xfs_bmap_finish() - userspace has it's own definition of this function, and so we need to move it out of xfs_bmap.[ch]. This means several other files need to include xfs_bmap_util.h as well. It also introduces and interesting dance for the stack switching code in xfs_bmapi_allocate(). The stack switching/workqueue code is actually moved to xfs_bmap_util.c, so that userspace can simply use a #define in a header file to connect the dots without needing to know about the stack switch code at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12xfs: reshuffle dir2 definitions around for userspaceDave Chinner
Many of the definitions within xfs_dir2_priv.h are needed in userspace outside libxfs. Definitions within xfs_dir2_priv.h are wholly contained within libxfs, so we need to shuffle some of the definitions around to keep consistency across files shared between user and kernel space. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12xfs: separate dquot on disk format definitions out of xfs_quota.hDave Chinner
The on disk format definitions of the on-disk dquot, log formats and quota off log formats are all intertwined with other definitions for quotas. Separate them out into their own header file so they can easily be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-09xfs: remove local fork format handling from xfs_bmapi_write()Dave Chinner
The conversion from local format to extent format requires interpretation of the data in the fork being converted, so it cannot be done in a generic way. It is up to the caller to convert the fork format to extent format before calling into xfs_bmapi_write() so format conversion can be done correctly. The code in xfs_bmapi_write() to convert the format is used implicitly by the attribute and directory code, but they specifically zero the fork size so that the conversion does not do any allocation or manipulation. Move this conversion into the shortform to leaf functions for the dir/attr code so the conversions are explicitly controlled by all callers. Now we can remove the conversion code in xfs_bmapi_write. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-04-27xfs: buffer type overruns blf_flags fieldDave Chinner
The buffer type passed to log recvoery in the buffer log item overruns the blf_flags field. I had assumed that flags field was a 32 bit value, and it turns out it is a unisgned short. Therefore having 19 flags doesn't really work. Convert the buffer type field to numeric value, and use the top 5 bits of the flags field for it. We currently have 17 types of buffers, so using 5 bits gives us plenty of room for expansion in future.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-04-27xfs: add CRCs to dir2/da node blocksDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-04-21xfs: split out symlink code into it's own file.Dave Chinner
The symlink code is about to get more complicated when CRCs are added for remote symlink blocks. The symlink management code is mostly self contained, so move it to it's own files so that all the new code and the existing symlink code will not be intermingled with other unrelated code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-04-21xfs: add support for large btree blocksChristoph Hellwig
Add support for larger btree blocks that contains a CRC32C checksum, a filesystem uuid and block number for detecting filesystem consistency and out of place writes. [dchinner@redhat.com] Also include an owner field to allow reverse mappings to be implemented for improved repairability and a LSN field to so that log recovery can easily determine the last modification that made it to disk for each buffer. [dchinner@redhat.com] Add buffer log format flags to indicate the type of buffer to recovery so that we don't have to do blind magic number tests to determine what the buffer is. [dchinner@redhat.com] Modified to fit into the verifier structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-03-14xfs: take inode version into account in XFS_LITINOChristoph Hellwig
Add a version argument to XFS_LITINO so that it can return different values depending on the inode version. This is required for the upcoming v3 inodes with a larger fixed layout dinode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-03-07xfs: rearrange some code in xfs_bmap for better localityDave Chinner
xfs_bmap.c is a big file, and some of the related code is spread all throughout the file requiring function prototypes for static function and jumping all through the file to follow a single call path. Rearrange the code so that: a) related functionality is grouped together; and b) functions are grouped in call dependency order While the diffstat is large, there are no code changes in the patch; it is just moving the functionality around and removing the function prototypes at the top of the file. The resulting layout of the code is as follows (top of file to bottom): - miscellaneous helper functions - extent tree block counting routines - debug/sanity checking code - bmap free list manipulation functions - inode fork format manipulation functions - internal/external extent tree seach functions - extent tree manipulation functions used during allocation - functions used during extent read/allocate/removal operations (i.e. xfs_bmapi_write, xfs_bmapi_read, xfs_bunmapi and xfs_getbmap) This means that following logic paths through the bmapi code is much simpler - most of the code relevant to a specific operation is now clustered together rather than spread all over the file.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-14xfs: xfs_bmap_add_attrfork_local is too genericDave Chinner
When we are converting local data to an extent format as a result of adding an attribute, the type of data contained in the local fork determines the behaviour that needs to occur. xfs_bmap_add_attrfork_local() already handles the directory data case specially by using S_ISDIR() and calling out to xfs_dir2_sf_to_block(), but with verifiers we now need to handle each different type of metadata specially and different metadata formats require different verifiers (and eventually block header initialisation). There is only a single place that we add and attribute fork to the inode, but that is in the attribute code and it knows nothing about the specific contents of the data fork. It is only the case of local data that is the issue here, so adding code to hadnle this case in the attribute specific code is wrong. Hence we are really stuck trying to detect the data fork contents in xfs_bmap_add_attrfork_local() and performing the correct callout there. Luckily the current cases can be determined by S_IS* macros, and we can push the work off to data specific callouts, but each of those callouts does a lot of work in common with xfs_bmap_local_to_extents(). The only reason that this fails for symlinks right now is is that xfs_bmap_local_to_extents() assumes the data fork contains extent data, and so attaches a a bmap extent data verifier to the buffer and simply copies the data fork information straight into it. To fix this, allow us to pass a "formatting" callback into xfs_bmap_local_to_extents() which is responsible for setting the buffer type, initialising it and copying the data fork contents over to the new buffer. This allows callers to specify how they want to format the new buffer (which is necessary for the upcoming CRC enabled metadata blocks) and hence make xfs_bmap_local_to_extents() useful for any type of data fork content. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-17xfs: pull up stack_switch check into xfs_bmapi_writeBrian Foster
The stack_switch check currently occurs in __xfs_bmapi_allocate, which means the stack switch only occurs when xfs_bmapi_allocate() is called in a loop. Pull the check up before the loop in xfs_bmapi_write() such that the first iteration of the loop has consistent behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-03xfs: don't zero structure members after a memset(0)Eric Sandeen
Commit 408cc4e97a3ccd172d2d676e4b585badf439271b added memset(0, ...) to allocation args structures, so there is no need to explicitly set any of the fields to 0 after that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-15xfs: convert buffer verifiers to an ops structure.Dave Chinner
To separate the verifiers from iodone functions and associate read and write verifiers at the same time, introduce a buffer verifier operations structure to the xfs_buf. This avoids the need for assigning the write verifier, clearing the iodone function and re-running ioend processing in the read verifier, and gets rid of the nasty "b_pre_io" name for the write verifier function pointer. If we ever need to, it will also be easier to add further content specific callbacks to a buffer with an ops structure in place. We also avoid needing to export verifier functions, instead we can simply export the ops structures for those that are needed outside the function they are defined in. This patch also fixes a directory block readahead verifier issue it exposed. This patch also adds ops callbacks to the inode/alloc btree blocks initialised by growfs. These will need more work before they will work with CRCs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-15xfs: connect up write verifiers to new buffersDave Chinner
Metadata buffers that are read from disk have write verifiers already attached to them, but newly allocated buffers do not. Add appropriate write verifiers to all new metadata buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-15xfs: verify btree blocks as they are read from diskDave Chinner
Add an btree block verify callback function and pass it into the buffer read functions. Because each different btree block type requires different verification, add a function to the ops structure that is called from the generic code. Also, propagate the verification callback functions through the readahead functions, and into the external bmap and bulkstat inode readahead code that uses the generic btree buffer read functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Phil White <pwhite@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-14xfs: remove xfs_flush_pagesDave Chinner
It is a complex wrapper around VFS functions, but there are VFS functions that provide exactly the same functionality. Call the VFS functions directly and remove the unnecessary indirection and complexity. We don't need to care about clearing the XFS_ITRUNCATED flag, as that is done during .writepages. Hence is cleared by the VFS writeback path if there is anything to write back during the flush. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-18xfs: move allocation stack switch up to xfs_bmapi_allocateDave Chinner
Switching stacks are xfs_alloc_vextent can cause deadlocks when we run out of worker threads on the allocation workqueue. This can occur because xfs_bmap_btalloc can make multiple calls to xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can return with the AGF locked in the current allocation transaction. If we then need to make another allocation, and all the allocation worker contexts are exhausted because the are blocked waiting for the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent work completed to release the AGF. Hence allocation effectively deadlocks. To avoid this, move the stack switch one layer up to xfs_bmapi_allocate() so that all of the allocation attempts in a single switched stack transaction occur in a single worker context. This avoids the problem of an allocation being blocked waiting for a worker thread whilst holding the AGF. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-18xfs: introduce XFS_BMAPI_STACK_SWITCHDave Chinner
Certain allocation paths through xfs_bmapi_write() are in situations where we have limited stack available. These are almost always in the buffered IO writeback path when convertion delayed allocation extents to real extents. The current stack switch occurs for userdata allocations, which means we also do stack switches for preallocation, direct IO and unwritten extent conversion, even those these call chains have never been implicated in a stack overrun. Hence, let's target just the single stack overun offended for stack switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that the caller can pass xfs_bmapi_write() to indicate it should switch stacks if it needs to do allocation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-10-18xfs: zero allocation_args on the kernel stackMark Tinguely
Zero the kernel stack space that makes up the xfs_alloc_arg structures. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-06-14xfs: make largest supported offset less shoutyDave Chinner
XFS_MAXIOFFSET() is just a simple macro that resolves to mp->m_maxioffset. It doesn't need to exist, and it just makes the code unnecessarily loud and shouty. Make it quiet and easy to read. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-21xfs: fix delalloc quota accounting on failureDave Chinner
xfstest 270 was causing quota reservations way beyond what was sane (ten to hundreds of TB) for a 4GB filesystem. There's a sign problem in the error handling path of xfs_bmapi_reserve_delalloc() because xfs_trans_unreserve_quota_nblks() simple negates the value passed - which doesn't work for an unsigned variable. This causes reservations of close to 2^32 block instead of removing a reservation of a handful of blocks. Fix the same problem in the other xfs_trans_unreserve_quota_nblks() callers where unsigned integer variables are used, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14xfs: move xfs_get_extsz_hint() and kill xfs_rw.hDave Chinner
The only thing left in xfs_rw.h is a function prototype for an inode function. Move that to xfs_inode.h, and kill xfs_rw.h. Also move the function implementing the prototype from xfs_rw.c to xfs_inode.c so we only have one function left in xfs_rw.c Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14xfs: move xfs_fsb_to_db to xfs_bmap.hDave Chinner
This is the only remaining useful function in xfs_rw.h, so move it to a header file responsible for block mapping functions that the callers already include. Soon we can get rid of xfs_rw.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-05-14xfs: don't assert on delalloc regions beyond EOFDave Chinner
When we are doing speculative delayed allocation beyond EOF, conversion of the region allocated beyond EOF is dependent on the largest free space extent available. If the largest free extent is smaller than the delalloc range, then after allocation we leave a delalloc extent that starts beyond EOF. This extent cannot *ever* be converted by flushing data, and so will remain there until either the EOF moves into the extent or it is truncated away. Hence if xfs_getbmap() runs on such an inode and is asked to return extents beyond EOF, it will assert fail on this extent even though there is nothing xfs_getbmap() can do to convert it to a real extent. Hence we should simply report these delalloc extents rather than assert that there should be none. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22xfs: fix deadlock in xfs_rtfree_extentKamal Dasu
To fix the deadlock caused by repeatedly calling xfs_rtfree_extent - removed xfs_ilock() and xfs_trans_ijoin() from xfs_rtfree_extent(), instead added asserts that the inode is locked and has an inode_item attached to it. - in xfs_bunmapi() when dealing with an inode with the rt flag call xfs_ilock() and xfs_trans_ijoin() so that the reference count is bumped on the inode and attached it to the transaction before calling into xfs_bmap_del_extent, similar to what we do in xfs_bmap_rtalloc. Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-15xfs: fallback to vmalloc for large buffers in xfs_getbmapDave Chinner
xfs_getbmap uses for a large buffer for extents, which is kmalloc'd. This can fail after the system has been running for some time as it is a high order allocation. Add a fallback to vmalloc so that it doesn't require contiguous memory and so won't randomly fail on files with large extent lists. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-17xfs: remove the i_size field in struct xfs_inodeChristoph Hellwig
There is no fundamental need to keep an in-memory inode size copy in the XFS inode. We already have the on-disk value in the dinode, and the separate in-memory copy that we need for regular files only in the XFS inode. Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the VFS inode i_size field for regular files. Switch code that was directly accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where we are limited to regular files direct access of the VFS inode i_size field. This also allows dropping some fairly complicated code in the write path which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size that is getting updated inside ->write_end. Note that we do not bother resetting the VFS i_size when truncating a file that gets freed to zero as there is no point in doing so because the VFS inode is no longer in use at this point. Just relax the assert in xfs_ifree to only check the on-disk size instead. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-01-17xfs: remove the if_ext_max field in struct xfs_iforkChristoph Hellwig
We spent a lot of effort to maintain this field, but it always equals to the fork size divided by the constant size of an extent. The prime use of it is to assert that the two stay in sync. Just divide the fork size by the extent size in the few places that we actually use it and remove the overhead of maintaining it. Also introduce a few helpers to consolidate the places where we actually care about the value. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2011-12-02xfs: fix allocation length overflow in xfs_bmapi_write()Dave Chinner
When testing the new xfstests --large-fs option that does very large file preallocations, this assert was tripped deep in xfs_alloc_vextent(): XFS: Assertion failed: args->minlen <= args->maxlen, file: fs/xfs/xfs_alloc.c, line: 2239 The allocation was trying to allocate a zero length extent because the lower 32 bits of the allocation length was zero. The remaining length of the allocation to be done was an exact multiple of 2^32 - the first case I saw was at 496TB remaining to be allocated. This turns out to be an overflow when converting the allocation length (a 64 bit quantity) into the extent length to allocate (a 32 bit quantity), and it requires the length to be allocated an exact multiple of 2^32 blocks to trip the assert. Fix it by limiting the extent lenth to allocate to MAXEXTLEN. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2011-10-11xfs: simplify xfs_trans_ijoin* againChristoph Hellwig
There is no reason to keep a reference to the inode even if we unlock it during transaction commit because we never drop a reference between the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref back into xfs_trans_ijoin - the third argument decides if an unlock is needed now. I'm actually starting to wonder if allowing inodes to be unlocked at transaction commit really is worth the effort. The only real benefit is that they can be unlocked earlier when commiting a synchronous transactions, but that could be solved by doing the log force manually after the unlock, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>