summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/file.c
AgeCommit message (Collapse)Author
2014-06-09Btrfs: fix transaction leak during fsync callFilipe Manana
If btrfs_log_dentry_safe() returns an error, we set ret to 1 and fall through with the goal of committing the transaction. However, in the case where the inode doesn't need a full sync, we would call btrfs_wait_ordered_range() against the target range for our inode, and if it returned an error, we would return without commiting or ending the transaction. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09btrfs: Avoid trucating page or punching hole in a already existed hole.Qu Wenruo
btrfs_punch_hole() will truncate unaligned pages or punch hole on a already existed hole. This will cause unneeded zero page or holes splitting the original huge hole. This patch will skip already existed holes before any page truncating or hole punching. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09btrfs: Drop EXTENT_UPTODATE check in hole punching and direct lockingAlex Gartrell
In these instances, we are trying to determine if a page has been accessed since we began the operation for the sake of retry. This is easily accomplished by doing a gang lookup in the page mapping radix tree, and it saves us the dependency on the flag (so that we might eventually delete it). btrfs_page_exists_in_range borrows heavily from find_get_page, replacing the radix tree look up with a gang lookup of 1, so that we can find the next highest page >= index and see if it falls into our lock range. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Alex Gartrell <agartrell@fb.com>
2014-06-09Btrfs: rework qgroup accountingJosef Bacik
Currently qgroups account for space by intercepting delayed ref updates to fs trees. It does this by adding sequence numbers to delayed ref updates so that it can figure out how the tree looked before the update so we can adjust the counters properly. The problem with this is that it does not allow delayed refs to be merged, so if you say are defragging an extent with 5k snapshots pointing to it we will thrash the delayed ref lock because we need to go back and manually merge these things together. Instead we want to process quota changes when we know they are going to happen, like when we first allocate an extent, we free a reference for an extent, we add new references etc. This patch accomplishes this by only adding qgroup operations for real ref changes. We only modify the sequence number when we need to lookup roots for bytenrs, this reduces the amount of churn on the sequence number and allows us to merge delayed refs as we add them most of the time. This patch encompasses a bunch of architectural changes 1) qgroup ref operations: instead of tracking qgroup operations through the delayed refs we simply add new ref operations whenever we notice that we need to when we've modified the refs themselves. 2) tree mod seq: we no longer have this separation of major/minor counters. this makes the sequence number stuff much more sane and we can remove some locking that was needed to protect the counter. 3) delayed ref seq: we now read the tree mod seq number and use that as our sequence. This means each new delayed ref doesn't have it's own unique sequence number, rather whenever we go to lookup backrefs we inc the sequence number so we can make sure to keep any new operations from screwing up our world view at that given point. This allows us to merge delayed refs during runtime. With all of these changes the delayed ref stuff is a little saner and the qgroup accounting stuff no longer goes negative in some cases like it was before. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09Btrfs: use bitfield instead of integer data type for the some variants in ↵Miao Xie
btrfs_root Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09Btrfs: fix leaf corruption caused by ENOSPC while hole punchingFilipe Manana
While running a stress test with multiple threads writing to the same btrfs file system, I ended up with a situation where a leaf was corrupted in that it had 2 file extent item keys that had the same exact key. I was able to detect this quickly thanks to the following patch which triggers an assertion as soon as a leaf is marked dirty if there are duplicated keys or out of order keys: Btrfs: check if items are ordered when a leaf is marked dirty (https://patchwork.kernel.org/patch/3955431/) Basically while running the test, I got the following in dmesg: [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]() (...) [28877.415917] Call Trace: [28877.415922] [<ffffffff816f1189>] dump_stack+0x4e/0x68 [28877.415926] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0 [28877.415929] [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20 [28877.415944] [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs] [28877.415949] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [28877.415962] [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs] [28877.415972] [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs] [28877.415984] [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs] (...) [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24 [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778 (...) [29854.132637] item 23 key (3486 108 667648) itemoff 2694 itemsize 53 [29854.132638] extent data disk bytenr 14574411776 nr 286720 [29854.132639] extent data offset 0 nr 286720 ram 286720 [29854.132640] item 24 key (3486 108 954368) itemoff 2641 itemsize 53 [29854.132641] extent data disk bytenr 0 nr 0 [29854.132643] extent data offset 0 nr 0 ram 0 [29854.132644] item 25 key (3486 108 954368) itemoff 2588 itemsize 53 [29854.132645] extent data disk bytenr 8699670528 nr 77824 [29854.132646] extent data offset 0 nr 77824 ram 77824 [29854.132647] item 26 key (3486 108 1146880) itemoff 2535 itemsize 53 [29854.132648] extent data disk bytenr 8699670528 nr 77824 [29854.132649] extent data offset 0 nr 77824 ram 77824 (...) [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901! (...) [29854.132771] Call Trace: [29854.132779] [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs] [29854.132791] [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs] [29854.132794] [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0 [29854.132797] [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10 [29854.132800] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [29854.132810] [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs] [29854.132820] [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs] [29854.132830] [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs] (...) So this is caused by getting an -ENOSPC error while punching a file hole, more specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one or more file extent items due to lack of enough free space. When this happens, in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction block reservation object to root->fs_info->trans_block_rsv, end our transaction and start a new transaction basically - and, we keep increasing our current offset (cur_offset) as long as it's smaller than the end of the target range (lockend) - this makes use leave the loop with cur_offset == drop_end which in turn makes us call fill_holes() for inserting a file extent item that represents a 0 bytes range hole (and this insertion succeeds, as in the meanwhile more space became available). This 0 bytes file hole extent item is a problem because any subsequent caller of __btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a start file offset that is equal to the offset of the hole, will not remove this extent item due to the following conditional in the while loop of __btrfs_drop_extents: if (extent_end <= search_start) { path->slots[0]++; goto next_slot; } This later makes the call to setup_items_for_insert() (at the very end of __btrfs_drop_extents), insert a new file extent item with the same offset as the 0 bytes file hole extent item that follows it. Needless is to say that this causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook), where we perform leaf sanity checks or in subsequent operations that manipulate file extent items, as in the fallocate call as shown by the dmesg trace above. Without my other patch to perform the leaf sanity checks once a leaf is marked as dirty (if the integrity checker is enabled), it would have been much harder to debug this issue. This change might fix a few similar issues reported by users in the mailing list regarding assertion failures in btrfs_set_item_key_safe calls performed by __btrfs_drop_extents, such as the following report: http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938 Asking fill_holes() to create a 0 bytes wide file hole item also produced the first warning in the trace above, as we passed a range to btrfs_drop_extent_cache that has an end smaller (by -1) than its start. On 3.14 kernels this issue manifests itself through leaf corruption, as we get duplicated file extent item keys in a leaf when calling setup_items_for_insert(), but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(), instead we have callers of __btrfs_drop_extents(), namely the functions inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling btrfs_insert_empty_item() to insert the new file extent item, which would fail with error -EEXIST, instead of inserting a duplicated key - which is still a serious issue as it would make all similar file extent item replace operations keep failing if they target the same file range. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09Btrfs: read inode size after acquiring the mutex when punching a holeFilipe Manana
In a previous change, commit 12870f1c9b2de7d475d22e73fd7db1b418599725, I accidentally moved the roundup of inode->i_size to outside of the critical section delimited by the inode mutex, which is not atomic and not correct since the size can be changed by other task before we acquire the mutex. Therefore fix it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: limit the path size in send to PATH_MAX Btrfs: correctly set profile flags on seqlock retry Btrfs: use correct key when repeating search for extent item Btrfs: fix inode caching vs tree log Btrfs: fix possible memory leaks in open_ctree() Btrfs: avoid triggering bug_on() when we fail to start inode caching task Btrfs: move btrfs_{set,clear}_and_info() to ctree.h btrfs: replace error code from btrfs_drop_extents btrfs: Change the hole range to a more accurate value. btrfs: fix use-after-free in mount_subvol()
2014-04-24btrfs: replace error code from btrfs_drop_extentsDavid Sterba
There's a case which clone does not handle and used to BUG_ON instead, (testcase xfstests/btrfs/035), now returns EINVAL. This error code is confusing to the ioctl caller, as it normally signifies errorneous arguments. Change it to ENOPNOTSUPP which allows a fall back to copy instead of clone. This does not affect the common reflink operation. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24btrfs: Change the hole range to a more accurate value.Qu Wenruo
Commit 3ac0d7b96a268a98bd474cab8bce3a9f125aaccf fixed the btrfs expanding write problem but the hole punched is sometimes too large for some iovec, which has unmapped data ranges. This patch will change to hole range to a more accurate value using the counts checked by the write check routines. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull second set of btrfs updates from Chris Mason: "The most important changes here are from Josef, fixing a btrfs regression in 3.14 that can cause corruptions in the extent allocation tree when snapshots are in use. Josef also fixed some deadlocks in send/recv and other assorted races when balance is running" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits) Btrfs: fix compile warnings on on avr32 platform btrfs: allow mounting btrfs subvolumes with different ro/rw options btrfs: export global block reserve size as space_info btrfs: fix crash in remount(thread_pool=) case Btrfs: abort the transaction when we don't find our extent ref Btrfs: fix EINVAL checks in btrfs_clone Btrfs: fix unlock in __start_delalloc_inodes() Btrfs: scrub raid56 stripes in the right way Btrfs: don't compress for a small write Btrfs: more efficient io tree navigation on wait_extent_bit Btrfs: send, build path string only once in send_hole btrfs: filter invalid arg for btrfs resize Btrfs: send, fix data corruption due to incorrect hole detection Btrfs: kmalloc() doesn't return an ERR_PTR Btrfs: fix snapshot vs nocow writting btrfs: Change the expanding write sequence to fix snapshot related bug. btrfs: make device scan less noisy btrfs: fix lockdep warning with reclaim lock inversion Btrfs: hold the commit_root_sem when getting the commit root during send Btrfs: remove transaction from send ...
2014-04-07mm: implement ->map_pages for page cacheKirill A. Shutemov
filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07btrfs: Change the expanding write sequence to fix snapshot related bug.Qu Wenruo
When testing fsstress with snapshot making background, some snapshot following problem. Snapshot 270: inode 323: size 0 Snapshot 271: inode 323: size 349145 |-------Hole---|---------Empty gap-------|-------Hole-----| 0 122880 172032 349145 Snapshot 272: inode 323: size 349145 |-------Hole---|------------Data---------|-------Hole-----| 0 122880 172032 349145 The fsstress operation on inode 323 is the following: write: offset 126832 len 43124 truncate: size 349145 Since the write with offset is consist of 2 operations: 1. punch hole 2. write data Hole punching is faster than data write, so hole punching in write and truncate is done first and then buffered write, so the snapshot 271 got empty gap, which will not pass btrfsck. To fix the bug, this patch will change the write sequence which will first punch a hole covering the write end if a hole is needed. Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs changes from Chris Mason: "This is a pretty long stream of bug fixes and performance fixes. Qu Wenruo has replaced the btrfs async threads with regular kernel workqueues. We'll keep an eye out for performance differences, but it's nice to be using more generic code for this. We still have some corruption fixes and other patches coming in for the merge window, but this batch is tested and ready to go" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits) Btrfs: fix a crash of clone with inline extents's split btrfs: fix uninit variable warning Btrfs: take into account total references when doing backref lookup Btrfs: part 2, fix incremental send's decision to delay a dir move/rename Btrfs: fix incremental send's decision to delay a dir move/rename Btrfs: remove unnecessary inode generation lookup in send Btrfs: fix race when updating existing ref head btrfs: Add trace for btrfs_workqueue alloc/destroy Btrfs: less fs tree lock contention when using autodefrag Btrfs: return EPERM when deleting a default subvolume Btrfs: add missing kfree in btrfs_destroy_workqueue Btrfs: cache extent states in defrag code path Btrfs: fix deadlock with nested trans handles Btrfs: fix possible empty list access when flushing the delalloc inodes Btrfs: split the global ordered extents mutex Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Btrfs: reclaim delalloc metadata more aggressively Btrfs: remove unnecessary lock in may_commit_transaction() Btrfs: remove the unnecessary flush when preparing the pages Btrfs: just do dirty page flush for the inode with compression before direct IO ...
2014-04-03fs/direct-io.c: remove some left over checksDan Carpenter
We know that "ret > 0" is true here. These tests were left over from commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-01generic_file_direct_write(): get rid of ppos argumentAl Viro
always equal to &iocb->ki_pos. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01btrfs_file_aio_write(): get rid of pposAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01callers of iov_copy_from_user_atomic() don't need pagecache_disable()Al Viro
... it does that itself (via kmap_atomic()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-21Btrfs: fix a crash of clone with inline extents's splitLiu Bo
xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split of inline extents in __btrfs_drop_extents(). For inline extents, we cannot duplicate another EXTENT_DATA item, because it breaks the rule of inline extents, that is, 'start offset' needs to be 0. We have set limitations for the source inode's compressed inline extents, because it needs to decompress and recompress. Now the destination inode's inline extents also need similar limitations. With this, xfstests btrfs/035 doesn't run into panic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-10Btrfs: remove the unnecessary flush when preparing the pagesMiao Xie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolumeMiao Xie
If the snapshot creation happened after the nocow write but before the dirty data flush, we would fail to flush the dirty data because of no space. So we must keep track of when those nocow write operations start and when they end, if there are nocow writers, the snapshot creators must wait. In order to implement this function, I introduce btrfs_{start, end}_nocow_write(), which is similar to mnt_{want,drop}_write(). These two functions are only used for nocow file write operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix preallocate vs double nocow writeMiao Xie
We can not release the reserved metadata space for the first write if we find the write position is pre-allocated. Because the kernel might write the data on the disk before we do the second write but after the can-nocow check, if we release the space for the first write, we might fail to update the metadata because of no space. Fix this problem by end nocow write if there is dirty data in the range whose space is pre-allocated. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix wrong lock range and write size in check_can_nocow()Miao Xie
The write range may not be sector-aligned, for example: |--------|--------| <- write range, sector-unaligned, size: 2blocks |--------|--------|--------| <- correct lock range, size: 3blocks But according to the old code, we used the size of write range to calculate the lock range directly, not considered the offset, we would get a wrong lock range: |--------|--------| <- write range, sector-unaligned, size: 2blocks |--------|--------| <- wrong lock range, size: 2blocks And besides that, the old code also had the same problem when calculating the real write size. Correct them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: more efficient btrfs_drop_extent_cacheFilipe Manana
While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: don't insert useless holes when punching beyond the inode's sizeFilipe Manana
If we punch beyond the size of an inode, we'll correctly remove any prealloc extents, but we'll also insert file extent items representing holes (disk bytenr == 0) that start with a key offset that lies beyond the inode's size and are not contiguous with the last file extent item. Example: $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo btrfs-debug-tree output: item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1 item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13 inode ref index 2 namelen 3 name: foo item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 90112 ram 122880 extent compression 0 item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 45056 ram 45056 extent compression 2 item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 860160 ram 860160 extent compression 0 The last extent item, which represents a hole, is useless as it lies beyond the inode's size. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix skipped error handle when log sync failedMiao Xie
It is possible that many tasks sync the log tree at the same time, but only one task can do the sync work, the others will wait for it. But those wait tasks didn't get the result of the log sync, and returned 0 when they ended the wait. It caused those tasks skipped the error handle, and the serious problem was they told the users the file sync succeeded but in fact they failed. This patch fixes this problem by introducing a log context structure, we insert it into the a global list. When the sync fails, we will set the error number of every log context in the list, then the waiting tasks get the error number of the log context and handle the error if need. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: faster/more efficient insertion of file extent itemsFilipe David Borba Manana
This is an extension to my previous commit titled: "Btrfs: faster file extent item replace operations" (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-29Btrfs: don't use ram_bytes for uncompressed inline itemsChris Mason
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect the new size. The fixe uses the size directly from the item header when reading uncompressed inlines, and also fixes truncate to update the size as it goes. Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org
2014-01-28Btrfs: fix the race between write back and nocow buffered writeMiao Xie
When we ran the 274th case of xfstests with nodatacow mount option, We met the following warning message: WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0 It is caused by the race between the write back and nocow buffered write: Task1 Task2 __btrfs_buffered_write() skip data reservation reserve the metadata space copy the data dirty the pages unlock the pages write back the pages release the data space becasue there is no noreserve flag set the noreserve flag This patch fixes this problem by unlocking the pages after the noreserve flag is set. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: make fsync latency less suckyJosef Bacik
Looking into some performance related issues with large amounts of metadata revealed that we can have some pretty huge swings in fsync() performance. If we have a lot of delayed refs backed up (as you will tend to do with lots of metadata) fsync() will wander off and try to run some of those delayed refs which can result in reading from disk and such. Since the actual act of fsync() doesn't create any delayed refs there is no need to make it throttle on delayed ref stuff, that will be handled by other people. With this patch we get much smoother fsync performance with large amounts of metadata. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: faster file extent item replace operationsFilipe David Borba Manana
When writing to a file we drop existing file extent items that cover the write range and then add a new file extent item that represents that write range. Before this change we were doing a tree lookup to remove the file extent items, and then after we did another tree lookup to insert the new file extent item. Most of the time all the file extent items we need to drop are located within a single leaf - this is the leaf where our new file extent item ends up at. Therefore, in this common case just combine these 2 operations into a single one. By avoiding the second btree navigation for insertion of the new file extent item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf COW operations, CPU time on btree node/leaf key binary searches, etc. Besides for file writes, this is an operation that happens for file fsync's as well. However log btrees are much less likely to big as big as regular fs btrees, therefore the impact of this change is smaller. The following benchmark was performed against an SSD drive and a HDD drive, both for random and sequential writes: sysbench --test=fileio --file-num=4096 --file-total-size=8G \ --file-test-mode=[rndwr|seqwr] --num-threads=512 \ --file-block-size=8192 \ --max-requests=1000000 \ --file-fsync-freq=0 --file-io-mode=sync [prepare|run] All results below are averages of 10 runs of the respective test. ** SSD sequential writes Before this change: 225.88 Mb/sec After this change: 277.26 Mb/sec ** SSD random writes Before this change: 49.91 Mb/sec After this change: 56.39 Mb/sec ** HDD sequential writes Before this change: 68.53 Mb/sec After this change: 69.87 Mb/sec ** HDD random writes Before this change: 13.04 Mb/sec After this change: 14.39 Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix use of uninitialized err variableFilipe David Borba Manana
fs/btrfs/file.c: In function ‘prepare_pages.isra.18’: fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized] Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix ordered extent check in btrfs_punch_holeFilipe David Borba Manana
If the ordered extent's last byte was 1 less than our region's start byte, we would unnecessarily wait for the completion of that ordered extent, because it doesn't intersect our target range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix the reserved space leak caused by the race between nonlock dio ↵Miao Xie
and buffered io When we ran sysbench on the fs with compression, the following WARN_ONs were triggered: fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents); fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents); fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes); Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync prepare # cd - # umount <mnt> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync run # cd - # umount <mnt> The reason of this problem is: Task0 Task1 btrfs_direct_IO unlock(&inode->i_mutex) lock(&inode->i_mutex) reserve_space() prepare_pages() lock_extent() clear_extent() unlock_extent() lock_extent() test_extent(uptodate) return false copy_data() set_delalloc_extent() extent need compress go back to buffered write clear_extent(DELALLOC | DIRTY) unlock_extent() Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which was set by task1, it made the dirty pages in that extents couldn't be flushed into the disk, so the reserved space for that extent was not released at the end. This patch fixes the above bug by unlocking the extent after the delalloc. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: cleanup unnecessary parameter and variant of prepare_pages()Miao Xie
- the caller has gotten the inode object, needn't pass the file object. And if so, we needn't define a inode pointer variant. - the position should be aligned by the page size not sector size, so we also needn't pass the root object into prepare_pages(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: incompatible format change to remove hole extentsJosef Bacik
Btrfs has always had these filler extent data items for holes in inodes. This has made somethings very easy, like logging hole punches and sending hole punches. However for large holey files these extent data items are pure overhead. So add an incompatible feature to no longer add hole extents to reduce the amount of metadata used by these sort of files. This has a few changes for logging and send obviously since they will need to detect holes and log/send the holes if there are any. I've tested this thoroughly with xfstests and it doesn't cause any issues with and without the incompat format set. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
2013-11-11btrfs: Fix checkpatch.pl warning of spacing issuesDulshani Gunawardhana
Fix spacing issues detected via checkpatch.pl in accordance with the kernel style guidelines. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: return an error from btrfs_wait_ordered_rangeJosef Bacik
I noticed that if the free space cache has an error writing out it's data it won't actually error out, it will just carry on. This is because it doesn't check the return value of btrfs_wait_ordered_range, which didn't actually return anything. So fix this in order to keep us from making free space cache look valid when it really isnt. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11btrfs: remove fs/btrfs/compat.hZach Brown
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink() and inc_nlink(). This doesn't belong in mainline. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11Btrfs: fix up seek_hole/seek_data handlingJosef Bacik
Whoever wrote this was braindead. Also it doesn't work right if you have VACANCY's since we assumed you would only have that at the end of the file, which won't be the case in the near future. I tested this with generic/285 and generic/286 as well as the btrfs tests that use fssum since it uses seek_hole/seek_data to verify things are ok. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are mostly bug fixes and a two small performance fixes. The most important of the bunch are Josef's fix for a snapshotting regression and Mark's update to fix compile problems on arm" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits) Btrfs: create the uuid tree on remount rw btrfs: change extent-same to copy entire argument struct Btrfs: dir_inode_operations should use btrfs_update_time also btrfs: Add btrfs: prefix to kernel log output btrfs: refuse to remount read-write after abort Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0 Btrfs: don't leak transaction in btrfs_sync_file() Btrfs: add the missing mutex unlock in write_all_supers() Btrfs: iput inode on allocation failure Btrfs: remove space_info->reservation_progress Btrfs: kill delay_iput arg to the wait_ordered functions Btrfs: fix worst case calculator for space usage Revert "Btrfs: rework the overcommit logic to be based on the total size" Btrfs: improve replacing nocow extents Btrfs: drop dir i_size when adding new names on replay Btrfs: replay dir_index items before other items Btrfs: check roots last log commit when checking if an inode has been logged Btrfs: actually log directory we are fsync()'ing Btrfs: actually limit the size of delalloc range Btrfs: allocate the free space by the existed max extent size when ENOSPC ...
2013-09-21Btrfs: don't leak transaction in btrfs_sync_file()Filipe David Borba Manana
In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would return without ending/freeing the transaction. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is against 3.11-rc7, but was pulled and tested against your tree as of yesterday. We do have two small incrementals queued up, but I wanted to get this bunch out the door before I hop on an airplane. This is a fairly large batch of fixes, performance improvements, and cleanups from the usual Btrfs suspects. We've included Stefan Behren's work to index subvolume UUIDs, which is targeted at speeding up send/receive with many subvolumes or snapshots in place. It closes a long standing performance issue that was built in to the disk format. Mark Fasheh's offline dedup work is also here. In this case offline means the FS is mounted and active, but the dedup work is not done inline during file IO. This is a building block where utilities are able to ask the FS to dedup a series of extents. The kernel takes care of verifying the data involved really is the same. Today this involves reading both extents, but we'll continue to evolve the patches" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits) Btrfs: optimize key searches in btrfs_search_slot Btrfs: don't use an async starter for most of our workers Btrfs: only update disk_i_size as we remove extents Btrfs: fix deadlock in uuid scan kthread Btrfs: stop refusing the relocation of chunk 0 Btrfs: fix memory leak of uuid_root in free_fs_info btrfs: reuse kbasename helper btrfs: return btrfs error code for dev excl ops err Btrfs: allow partial ordered extent completion Btrfs: convert all bug_ons in free-space-cache.c Btrfs: add support for asserts Btrfs: adjust the fs_devices->missing count on unmount Btrf: cleanup: don't check for root_refs == 0 twice Btrfs: fix for patch "cleanup: don't check the same thing twice" Btrfs: get rid of one BUG() in write_all_supers() Btrfs: allocate prelim_ref with a slab allocater Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl Btrfs: fix race between removing a dev and writing sbs Btrfs: remove ourselves from the cluster list under lock ...
2013-09-04direct-io: Handle O_(D)SYNC AIOChristoph Hellwig
Call generic_write_sync() from the deferred I/O completion handler if O_DSYNC is set for a write request. Also make sure various callers don't call generic_write_sync if the direct I/O code returns -EIOCBQUEUED. Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-01Btrf: cleanup: don't check for root_refs == 0 twiceStefan Behrens
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs() is zero and returns ENOENT in this case. There is no need to do it again in three more places. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs: avoid starting a transaction in the write pathJosef Bacik
I noticed while looking at a deadlock that we are always starting a transaction in cow_file_range(). This isn't really needed since we only need a transaction if we are doing an inline extent, or if the allocator needs to allocate a chunk. So push down all the transaction start stuff to be closer to where we actually need a transaction in all of these cases. This will hopefully reduce our write latency when we are committing often. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs: don't bother autodefragging if our root is going awayJosef Bacik
We can end up with inodes on the auto defrag list that exist on roots that are going to be deleted. This is extra work we don't need to do, so just bail if our root has 0 root refs. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: allow splitting of hole em's when dropping extent cacheJosef Bacik
I noticed while running multi-threaded fsync tests that sometimes fsck would complain about an improper gap. This happens because we fail to add a hole extent to the file, which was happening when we'd split a hole EM because btrfs_drop_extent_cache was just discarding the whole em instead of splitting it. So this patch fixes this by allowing us to split a hole em properly, which means that added holes actually get logged properly and we no longer see this fsck error. Thankfully we're tolerant of these sort of problems so a user would not see any adverse effects of this bug, other than fsck complaining. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-07-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "These are the usual mixture of bugs, cleanups and performance fixes. Miao has some really nice tuning of our crc code as well as our transaction commits. Josef is peeling off more and more problems related to early enospc, and has a number of important bug fixes in here too" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits) Btrfs: wait ordered range before doing direct io Btrfs: only do the tree_mod_log_free_eb if this is our last ref Btrfs: hold the tree mod lock in __tree_mod_log_rewind Btrfs: make backref walking code handle skinny metadata Btrfs: fix crash regarding to ulist_add_merge Btrfs: fix several potential problems in copy_nocow_pages_for_inode Btrfs: cleanup the code of copy_nocow_pages_for_inode() Btrfs: fix oops when recovering the file data by scrub function Btrfs: make the chunk allocator completely tree lockless Btrfs: cleanup orphaned root orphan item Btrfs: fix wrong mirror number tuning Btrfs: cleanup redundant code in btrfs_submit_direct() Btrfs: remove btrfs_sector_sum structure Btrfs: check if we can nocow if we don't have data space Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc Btrfs: use a percpu to keep track of possibly pinned bytes Btrfs: check for actual acls rather than just xattrs when caching no acl Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate Btrfs: optimize reada_for_balance Btrfs: optimize read_block_for_search ...