summaryrefslogtreecommitdiffstats
path: root/fs/ext4/page-io.c
AgeCommit message (Collapse)Author
2011-02-12ext4: serialize unaligned asynchronous DIOEric Sandeen
ext4 has a data corruption case when doing non-block-aligned asynchronous direct IO into a sparse file, as demonstrated by xfstest 240. The root cause is that while ext4 preallocates space in the hole, mappings of that space still look "new" and dio_zero_block() will zero out the unwritten portions. When more than one AIO thread is going, they both find this "new" block and race to zero out their portion; this is uncoordinated and causes data corruption. Dave Chinner fixed this for xfs by simply serializing all unaligned asynchronous direct IO. I've done the same here. The difference is that we only wait on conversions, not all IO. This is a very big hammer, and I'm not very pleased with stuffing this into ext4_file_write(). But since ext4 is DIO_LOCKING, we need to serialize it at this high level. I tried to move this into ext4_ext_direct_IO, but by then we have the i_mutex already, and we will wait on the work queue to do conversions - which must also take the i_mutex. So that won't work. This was originally exposed by qemu-kvm installing to a raw disk image with a normal sector-63 alignment. I've tested a backport of this patch with qemu, and it does avoid the corruption. It is also quite a lot slower (14 min for package installs, vs. 8 min for well-aligned) but I'll take slow correctness over fast corruption any day. Mingming suggested that we can track outstanding conversions, and wait on those so that non-sparse files won't be affected, and I've implemented that here; unaligned AIO to nonsparse files won't take a perf hit. [tytso@mit.edu: Keep the mutex as a hashed array instead of bloating the ext4 inode] [tytso@mit.edu: Fix up namespace issues so that global variables are protected with an "ext4_" prefix.] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-07ext4: Fix data corruption with multi-block writepages supportCurt Wohlgemuth
This fixes a corruption problem with the multi-block writepages submittal change for ext4, from commit bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc ("ext4: use bio layer instead of buffer layer in mpage_da_submit_io"). (Note that this corruption is not present in 2.6.37 on ext4, because the corruption was detected after the feature was merged in 2.6.37-rc1, and so it was turned off by adding a non-default mount option, mblk_io_submit. With this commit, which hopefully fixes the last of the bugs with this feature, we'll be able to turn on this performance feature by default in 2.6.38, and remove the mblk_io_submit option.) The ext4 code path to bundle multiple pages for writeback in ext4_bio_write_page() had a bug: we should be clearing buffer head dirty flags *before* we submit the bio, not in the completion routine. The patch below was tested on 2.6.37 under KVM with the postgresql script which was submitted by Jon Nelson as documented in commit 1449032be1. Without the patch, I'd hit the corruption problem about 50-70% of the time. With the patch, I executed the script > 100 times with no corruption seen. I also fixed a bug to make sure ext4_end_bio() doesn't dereference the bio after the bio_put() call. Reported-by: Jon Nelson <jnelson@jamponi.net> Reported-by: Matthias Bayer <jackdachef@gmail.com> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2011-01-10ext4: test the correct variable in ext4_init_pageio()Dan Carpenter
This is a copy and paste error. The intent was to check "io_page_cachep". We tested "io_page_cachep" earlier. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19ext4: use kmem_cache_zalloc() in ext4_init_io_end()Jesper Juhl
Use advantage of kmem_cache_zalloc() to remove a memset() call in ext4_init_io_end() and save a few bytes. Before: [jj@dragon linux-2.6]$ size fs/ext4/page-io.o text data bss dec hex filename 3016 0 624 3640 e38 fs/ext4/page-io.o After: [jj@dragon linux-2.6]$ size fs/ext4/page-io.o text data bss dec hex filename 3000 0 624 3624 e28 fs/ext4/page-io.o Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17ext4: fix setting random pages PageUptodateMarkus Trippelsdorf
ext4_end_bio calls put_page and kmem_cache_free before calling SetPageUpdate(). This can result in setting the PageUptodate bit on random pages and causes the following BUG: BUG: Bad page state in process rm pfn:52e54 page:ffffea0001222260 count:0 mapcount:0 mapping: (null) index:0x0 arch kernel: page flags: 0x4000000000000008(uptodate) Fix the problem by moving put_io_page() after the SetPageUpdate() call. Thanks to Hugh Dickins for analyzing this problem. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08ext4: fix potential race when freeing ext4_io_page structuresTheodore Ts'o
Use an atomic_t and make sure we don't free the structure while we might still be submitting I/O for that page. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08ext4: handle writeback of inodes which are being freedTheodore Ts'o
The following BUG can occur when an inode which is getting freed when it still has dirty pages outstanding, and it gets deleted (in this because it was the target of a rename). In ordered mode, we need to make sure the data pages are written just in case we crash before the rename (or unlink) is committed. If the inode is being freed then when we try to igrab the inode, we end up tripping the BUG_ON at fs/ext4/page-io.c:146. To solve this problem, we need to keep track of the number of io callbacks which are pending, and avoid destroying the inode until they have all been completed. That way we don't have to bump the inode count to keep the inode from being destroyed; an approach which doesn't work because the count could have already been dropped down to zero before the inode writeback has started (at which point we're not allowed to bump the count back up to 1, since it's already started getting freed). Thanks to Dave Chinner for suggesting this approach, which is also used by XFS. kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146! Call Trace: [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307 [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2 [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5 [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d [<ffffffff81087910>] do_writepages+0x1c/0x25 [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10 [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2 [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c [<ffffffff810c14a3>] evict+0x22/0x92 [<ffffffff810c1a3d>] iput+0x212/0x249 [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9 [<ffffffff810bdf6b>] d_kill+0x3d/0x5d [<ffffffff810be613>] dput+0x13a/0x147 [<ffffffff810b990d>] sys_renameat+0x1b5/0x258 [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38 [<ffffffff810b99c6>] sys_rename+0x16/0x18 [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b Reported-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Nick Bowler <nbowler@elliptictech.com>
2010-10-27ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()Theodore Ts'o
This is a cleanup to avoid namespace leaks out of fs/ext4 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27ext4: use bio layer instead of buffer layer in mpage_da_submit_ioTheodore Ts'o
Call the block I/O layer directly instad of going through the buffer layer. This should give us much better performance and scalability, as well as lowering our CPU utilization when doing buffered writeback. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>