summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2013-06-06NFSv4.1: Enable state protectionTrond Myklebust
Use the EXCHGID4_FLAG_BIND_PRINC_STATEID exchange_id flag to enable stateid protection. This means that if we create a stateid using a particular principal, then we must use the same principal if we want to change that state. IOW: if we OPEN a file using a particular credential, then we have to use the same credential in subsequent OPEN_DOWNGRADE, CLOSE, or DELEGRETURN operations that use that stateid. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06NFSv4.1: Use layout credentials for get_deviceinfo callsTrond Myklebust
This is not strictly needed, since get_deviceinfo is not allowed to return NFS4ERR_ACCESS or NFS4ERR_WRONG_CRED, but lets do it anyway for consistency with other pNFS operations. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06NFSv4.1: Ensure that test_stateid and free_stateid use correct credentialsTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06NFSv4.1: Ensure that reclaim_complete uses the right credentialTrond Myklebust
We want to use the same credential for reclaim_complete as we used for the exchange_id call. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06NFSv4.1: Ensure that layoutreturn uses the correct credentialTrond Myklebust
We need to use the same credential as was used for the layoutget and/or layoutcommit operations. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06NFSv4.1: Ensure that layoutget is called using the layout credentialTrond Myklebust
Ensure that we use the same credential for layoutget, layoutcommit and layoutreturn. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06ext4: use ext4_da_writepages() for all modesTheodore Ts'o
Rename ext4_da_writepages() to ext4_writepages() and use it for all modes. We still need to iterate over all the pages in the case of data=journalling, but in the case of nodelalloc/data=ordered (which is what file systems mounted using ext3 backwards compatibility will use) this will allow us to use a much more efficient I/O submission path. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06xfs: increase number of ACL entries for V5 superblocksDave Chinner
The limit of 25 ACL entries is arbitrary, but baked into the on-disk format. For version 5 superblocks, increase it to the maximum nuber of ACLs that can fit into a single xattr. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 5c87d4bc1a86bd6e6754ac3d6e111d776ddcfe57)
2013-06-06xfs: disable noattr2/attr2 mount options for CRC enabled filesystemsDave Chinner
attr2 format is always enabled for v5 superblock filesystems, so the mount options to enable or disable it need to be cause mount errors. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit d3eaace84e40bf946129e516dcbd617173c1cf14)
2013-06-06xfs: inode unlinked list needs to recalculate the inode CRCDave Chinner
The inode unlinked list manipulations operate directly on the inode buffer, and so bypass the inode CRC calculation mechanisms. Hence an inode on the unlinked list has an invalid CRC. Fix this by recalculating the CRC whenever we modify an unlinked list pointer in an inode, ncluding during log recovery. This is trivial to do and results in unlinked list operations always leaving a consistent inode in the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 0a32c26e720a8b38971d0685976f4a7d63f9e2ef)
2013-06-06xfs: fix log recovery transaction item reorderingDave Chinner
There are several constraints that inode allocation and unlink logging impose on log recovery. These all stem from the fact that inode alloc/unlink are logged in buffers, but all other inode changes are logged in inode items. Hence there are ordering constraints that recovery must follow to ensure the correct result occurs. As it turns out, this ordering has been working mostly by chance than good management. The existing code moves all buffers except cancelled buffers to the head of the list, and everything else to the tail of the list. The problem with this is that is interleaves inode items with the buffer cancellation items, and hence whether the inode item in an cancelled buffer gets replayed is essentially left to chance. Further, this ordering causes problems for log recovery when inode CRCs are enabled. It typically replays the inode unlink buffer long before it replays the inode core changes, and so the CRC recorded in an unlink buffer is going to be invalid and hence any attempt to validate the inode in the buffer is going to fail. Hence we really need to enforce the ordering that the inode alloc/unlink code has expected log recovery to have since inode chunk de-allocation was introduced back in 2003... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit a775ad778073d55744ed6709ccede36310638911)
2013-06-06xfs: fix remote attribute invalidation for a leafDave Chinner
When invalidating an attribute leaf block block, there might be remote attributes that it points to. With the recent rework of the remote attribute format, we have to make sure we calculate the length of the attribute correctly. We aren't doing that in xfs_attr3_leaf_inactive(), so fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 59913f14dfe8eb772ff93eb442947451b4416329)
2013-06-06xfs: rework dquot CRCsDave Chinner
Calculating dquot CRCs when the backing buffer is written back just doesn't work reliably. There are several places which manipulate dquots directly in the buffers, and they don't calculate CRCs appropriately, nor do they always set the buffer up to calculate CRCs appropriately. Firstly, if we log a dquot buffer (e.g. during allocation) it gets logged without valid CRC, and so on recovery we end up with a dquot that is not valid. Secondly, if we recover/repair a dquot, we don't have a verifier attached to the buffer and hence CRCs are not calculated on the way down to disk. Thirdly, calculating the CRC after we've changed the contents means that if we re-read the dquot from the buffer, we cannot verify the contents of the dquot are valid, as the CRC is invalid. So, to avoid all the dquot CRC errors that are being detected by the read verifier, change to using the same model as for inodes. That is, dquot CRCs are calculated and written to the backing buffer at the time the dquot is flushed to the backing buffer. If we modify the dquot directly in the backing buffer, calculate the CRC immediately after the modification is complete. Hence the dquot in the on-disk buffer should always have a valid CRC. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 6fcdc59de28817d1fbf1bd58cc01f4f3fac858fb)
2013-06-06ext4: optimize test_root()Theodore Ts'o
The test_root() function could potentially loop forever due to overflow issues. So rewrite test_root() to avoid this issue; as a bonus, it is 38% faster when benchmarked via a test loop: int main(int argc, char **argv) { int i; for (i = 0; i < 1 << 24; i++) { if (test_root(i, 7)) printf("%d\n", i); } } Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06ext4: add sanity check to ext4_get_group_info()Theodore Ts'o
The group number passed to ext4_get_group_info() should be valid, but let's add an assert to check this before we start creating a pointer based on that group number and dereferencing it. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06ext4: verify group number in verify_group_input() before using itTheodore Ts'o
Check the group number for sanity earilier, before calling routines such as ext4_bg_has_super() or ext4_group_overhead_blocks(). Reported-by: Jonathan Salwan <jonathan.salwan@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06ext4: add check to io_submit_init_bioTheodore Ts'o
The bio_alloc() function can return NULL if the memory allocation fails. So we need to check for this. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06GFS2: fix error propagation in init_threads()Alexey Khoroshilov
If kthread_run() fails, init_threads() returns IS_ERR(p) instead of PTR_ERR(p). Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-06f2fs: reorganise the function get_victim_by_defaultNamjae Jeon
Fix the function get_victim_by_default, where it checks for the condition that p.min_segno != NULL_SEGNO as shown: if (p.min_segno != NULL_SEGNO) goto got_it; and if above condition is true then got_it: if (p.min_segno != NULL_SEGNO) { So this condition is being checked twice. Hence move the goto statement after the if condition so that duplication of condition check is avoided. Also this function makes a call to get_max_cost() to compute the max cost based on the f2fs_sbi_info and victim policy. Since get_max_cost depends on on three parameters of victim_sel_policy => alloc_mode, gc_mode & ofs_unit, once this victim policy is initialised, these value will not change till the execution time of get_victim_by_default() & also f2fs_sbi_info structure parameters will not change. Hence making calls to get_max_cost() in while loop does not seems to be a good point. Instead we can call it once in begining and store the results in local variable, which later can serve our purpose for comparing the cost with max cost inside the while loop. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-05jfs: Update jfs_errorJoe Perches
Use a more current logging style. Add __printf format and argument verification. Remove embedded function names from formats. Add %pf, __builtin_return_address(0) to jfs_error. Add newlines to formats for kernel style consistency. (One format already had an erroneous newline) Coalesce formats and align arguments. Object size reduced ~1KiB. $ size fs/jfs/built-in.o* text data bss dec hex filename 201891 35488 63936 301315 49903 fs/jfs/built-in.o.new 202821 35488 64192 302501 49da5 fs/jfs/built-in.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-06-05jfs: fix sparse warning in fs/jfs/xattr.cDave Kleikamp
CHECK fs/jfs/xattr.c fs/jfs/xattr.c:1092:5: warning: symbol 'jfs_initxattrs' was not declared. Should it be static? Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-06-05xfs: increase number of ACL entries for V5 superblocksDave Chinner
The limit of 25 ACL entries is arbitrary, but baked into the on-disk format. For version 5 superblocks, increase it to the maximum nuber of ACLs that can fit into a single xattr. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05xfs: disable noattr2/attr2 mount options for CRC enabled filesystemsDave Chinner
attr2 format is always enabled for v5 superblock filesystems, so the mount options to enable or disable it need to be cause mount errors. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05xfs: inode unlinked list needs to recalculate the inode CRCDave Chinner
The inode unlinked list manipulations operate directly on the inode buffer, and so bypass the inode CRC calculation mechanisms. Hence an inode on the unlinked list has an invalid CRC. Fix this by recalculating the CRC whenever we modify an unlinked list pointer in an inode, ncluding during log recovery. This is trivial to do and results in unlinked list operations always leaving a consistent inode in the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05xfs: fix log recovery transaction item reorderingDave Chinner
There are several constraints that inode allocation and unlink logging impose on log recovery. These all stem from the fact that inode alloc/unlink are logged in buffers, but all other inode changes are logged in inode items. Hence there are ordering constraints that recovery must follow to ensure the correct result occurs. As it turns out, this ordering has been working mostly by chance than good management. The existing code moves all buffers except cancelled buffers to the head of the list, and everything else to the tail of the list. The problem with this is that is interleaves inode items with the buffer cancellation items, and hence whether the inode item in an cancelled buffer gets replayed is essentially left to chance. Further, this ordering causes problems for log recovery when inode CRCs are enabled. It typically replays the inode unlink buffer long before it replays the inode core changes, and so the CRC recorded in an unlink buffer is going to be invalid and hence any attempt to validate the inode in the buffer is going to fail. Hence we really need to enforce the ordering that the inode alloc/unlink code has expected log recovery to have since inode chunk de-allocation was introduced back in 2003... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05GFS2: Remove no-op wrapper functionSteven Whitehouse
This wrapper function is no longer required, so get rid of it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-05GFS2: Cocci spatch "ptr_ret.spatch"Thomas Meyer
Use PTR_RET in place of open coding this function. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-05GFS2: Eliminate gfs2_rg_lopsBob Peterson
With recent changes to the transactions, it appears that we are no longer using the "log ops" for resource groups. Since the log commit code processes the array of log ops, eliminating this should be marginally better for performance. Therefore this patch eliminates it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-05GFS2: Sort buffer lists by inplace block numberBenjamin Marzinski
This patch simply sort the data and metadata buffer lists by their inplace block number. This makes gfs2_log_flush issue the inplace IO in sequential order, which will hopefully speed up writing the IO out to disk. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-04eCryptfs: Check return of filemap_write_and_wait during fsyncTyler Hicks
Error out of ecryptfs_fsync() if filemap_write_and_wait() fails. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Cc: Paul Taysom <taysom@chromium.org> Cc: Olof Johansson <olofj@chromium.org> Cc: stable@vger.kernel.org # v3.6+
2013-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixesLinus Torvalds
Pull gfs2 fixes from Steven Whitehouse: "There are four patches this time. The first fixes a problem where the wrong descriptor type was being written into the log for journaled data blocks. The second fixes a race relating to the deallocation of allocator data. The third provides a fallback if kmalloc is unable to satisfy a request to allocate a directory hash table. The fourth fixes the iopen glock caching so that inodes are deleted in a more timely manner after rmdir/unlink" * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes: GFS2: Don't cache iopen glocks GFS2: Fall back to vmalloc if kmalloc fails for dir hash tables GFS2: Increase i_writecount during gfs2_setattr_size GFS2: Set log descriptor type for jdata blocks
2013-06-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "One patch fixes an Oops introduced in 3.9 with the readdirplus feature. The rest are fixes for async-dio in 3.10" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix alignment in short read optimization for async_dio fuse: return -EIOCBQUEUED from fuse_direct_IO() for all async requests fuse: fix readdirplus Oops in fuse_dentry_revalidate fuse: update inode size and invalidate attributes on fallocate fuse: truncate pagecache range on hole punch fuse: allocate for_background dio requests based on io->async state
2013-06-04xfs: fix remote attribute invalidation for a leafDave Chinner
When invalidating an attribute leaf block block, there might be remote attributes that it points to. With the recent rework of the remote attribute format, we have to make sure we calculate the length of the attribute correctly. We aren't doing that in xfs_attr3_leaf_inactive(), so fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinuguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-04xfs: rework dquot CRCsDave Chinner
Calculating dquot CRCs when the backing buffer is written back just doesn't work reliably. There are several places which manipulate dquots directly in the buffers, and they don't calculate CRCs appropriately, nor do they always set the buffer up to calculate CRCs appropriately. Firstly, if we log a dquot buffer (e.g. during allocation) it gets logged without valid CRC, and so on recovery we end up with a dquot that is not valid. Secondly, if we recover/repair a dquot, we don't have a verifier attached to the buffer and hence CRCs are not calculated on the way down to disk. Thirdly, calculating the CRC after we've changed the contents means that if we re-read the dquot from the buffer, we cannot verify the contents of the dquot are valid, as the CRC is invalid. So, to avoid all the dquot CRC errors that are being detected by the read verifier, change to using the same model as for inodes. That is, dquot CRCs are calculated and written to the backing buffer at the time the dquot is flushed to the backing buffer. If we modify the dquot directly in the backing buffer, calculate the CRC immediately after the modification is complete. Hence the dquot in the on-disk buffer should always have a valid CRC. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-04ext4: remove ext4_ioend_wait()Jan Kara
Now that we clear PageWriteback after extent conversion, there's no need to wait for io_end processing in ext4_evict_inode(). Running AIO/DIO keeps file reference until aio_complete() is called so ext4_evict_inode() cannot be called. For io_end structures resulting from buffered IO waiting is happening because we wait for PageWriteback in truncate_inode_pages(). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: don't wait for extent conversion in ext4_punch_hole()Jan Kara
We don't have to wait for extent conversion in ext4_punch_hole() as buffered IO for the punched range has been flushed and waited upon (thus all extent conversions for that range have completed). Also we wait for all DIO to finish using inode_dio_wait() so there cannot be any extent conversions pending due to direct IO. Also remove ext4_flush_unwritten_io() since it's unused now. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: Remove wait for unwritten extents in ext4_ind_direct_IO()Jan Kara
We don't have to wait for unwritten extent conversion in ext4_ind_direct_IO() as all writes that happened before DIO are flushed by the generic code and extent conversion has happened before we cleared PageWriteback bit. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: remove i_mutex from ext4_file_sync()Jan Kara
After removal of ext4_flush_unwritten_io() call, ext4_file_sync() doesn't need i_mutex anymore. Forcing of transaction commits doesn't need i_mutex as there's nothing inode specific in that code apart from grabbing transaction ids from the inode. So remove the lock. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: use generic_file_fsync() in ext4_file_fsync() in nojournal modeJan Kara
Just use the generic function instead of duplicating it. We only need to reshuffle the read-only check a bit (which is there to prevent writing to a filesystem which has been remounted read-only after error I assume). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: remove wait for unwritten extent conversion from ext4_truncate()Jan Kara
Since PageWriteback bit is now cleared after extents are converted from unwritten to written ones, we have full exclusion of writeback path from truncate (truncate_inode_pages() waits for PageWriteback bits to get cleared on all invalidated pages). Exclusion from DIO path is achieved by inode_dio_wait() call in ext4_setattr(). So there's no need to wait for extent convertion in ext4_truncate() anymore. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: protect extent conversion after DIO with i_dio_countJan Kara
Make sure extent conversion after DIO happens while i_dio_count is still elevated so that inode_dio_wait() waits until extent conversion is done. This removes the need for explicit waiting for extent conversion in some cases. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: defer clearing of PageWriteback after extent conversionJan Kara
Currently PageWriteback bit gets cleared from put_io_page() called from ext4_end_bio(). This is somewhat inconvenient as extent tree is not fully updated at that time (unwritten extents are not marked as written) so we cannot read the data back yet. This design was dictated by lock ordering as we cannot start a transaction while PageWriteback bit is set (we could easily deadlock with ext4_da_writepages()). But now that we use transaction reservation for extent conversion, locking issues are solved and we can move PageWriteback bit clearing after extent conversion is done. As a result we can remove wait for unwritten extent conversion from ext4_sync_file() because it already implicitely happens through wait_on_page_writeback(). We implement deferring of PageWriteback clearing by queueing completed bios to appropriate io_end and processing all the pages when io_end is going to be freed instead of at the moment ext4_io_end() is called. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: split extent conversion lists to reserved & unreserved partsJan Kara
Now that we have extent conversions with reserved transaction, we have to prevent extent conversions without reserved transaction (from DIO code) to block these (as that would effectively void any transaction reservation we did). So split lists, work items, and work queues to reserved and unreserved parts. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: use transaction reservation for extent conversion in ext4_end_ioJan Kara
Later we would like to clear PageWriteback bit only after extent conversion from unwritten to written extents is performed. However it is not possible to start a transaction after PageWriteback is set because that violates lock ordering (and is easy to deadlock). So we have to reserve a transaction before locking pages and sending them for IO and later we use the transaction for extent conversion from ext4_end_io(). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: remove buffer_uninit handlingJan Kara
There isn't any need for setting BH_Uninit on buffers anymore. It was only used to signal we need to mark io_end as needing extent conversion in add_bh_to_extent() but now we can mark the io_end directly when mapping extent. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: restructure writeback pathJan Kara
There are two issues with current writeback path in ext4. For one we don't necessarily map complete pages when blocksize < pagesize and thus needn't do any writeback in one iteration. We always map some blocks though so we will eventually finish mapping the page. Just if writeback races with other operations on the file, forward progress is not really guaranteed. The second problem is that current code structure makes it hard to associate all the bios to some range of pages with one io_end structure so that unwritten extents can be converted after all the bios are finished. This will be especially difficult later when io_end will be associated with reserved transaction handle. We restructure the writeback path to a relatively simple loop which first prepares extent of pages, then maps one or more extents so that no page is partially mapped, and once page is fully mapped it is submitted for IO. We keep all the mapping and IO submission information in mpage_da_data structure to somewhat reduce stack usage. Resulting code is somewhat shorter than the old one and hopefully also easier to read. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: better estimate credits needed for ext4_da_writepages()Jan Kara
We limit the number of blocks written in a single loop of ext4_da_writepages() to 64 when inode uses indirect blocks. That is unnecessary as credit estimates for mapping logically continguous run of blocks is rather low even for inode with indirect blocks. So just lift this limitation and properly calculate the number of necessary credits. This better credit estimate will also later allow us to always write at least a single page in one iteration. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: improve writepage credit estimate for files with indirect blocksJan Kara
ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether blocks mapped are logically contiguous. That is wrong since the argument informs whether the blocks are physically contiguous. As the blocks mapped are always logically contiguous and that's all ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: deprecate max_writeback_mb_bump sysfs attributeJan Kara
This attribute is now unused so deprecate it. We still show the old default value to keep some compatibility but we don't allow writing to that attribute anymore. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: stop messing with nr_to_write in ext4_da_writepages()Jan Kara
Writeback code got better in how it submits IO and now the number of pages requested to be written is usually higher than original 1024. The number is now dynamically computed based on observed throughput and is set to be about 0.5 s worth of writeback. E.g. on ordinary SATA drive this ends up somewhere around 10000 as my testing shows. So remove the unnecessary smarts from ext4_da_writepages(). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>