summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2014-04-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs changes from Chris Mason: "This is a pretty long stream of bug fixes and performance fixes. Qu Wenruo has replaced the btrfs async threads with regular kernel workqueues. We'll keep an eye out for performance differences, but it's nice to be using more generic code for this. We still have some corruption fixes and other patches coming in for the merge window, but this batch is tested and ready to go" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits) Btrfs: fix a crash of clone with inline extents's split btrfs: fix uninit variable warning Btrfs: take into account total references when doing backref lookup Btrfs: part 2, fix incremental send's decision to delay a dir move/rename Btrfs: fix incremental send's decision to delay a dir move/rename Btrfs: remove unnecessary inode generation lookup in send Btrfs: fix race when updating existing ref head btrfs: Add trace for btrfs_workqueue alloc/destroy Btrfs: less fs tree lock contention when using autodefrag Btrfs: return EPERM when deleting a default subvolume Btrfs: add missing kfree in btrfs_destroy_workqueue Btrfs: cache extent states in defrag code path Btrfs: fix deadlock with nested trans handles Btrfs: fix possible empty list access when flushing the delalloc inodes Btrfs: split the global ordered extents mutex Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Btrfs: reclaim delalloc metadata more aggressively Btrfs: remove unnecessary lock in may_commit_transaction() Btrfs: remove the unnecessary flush when preparing the pages Btrfs: just do dirty page flush for the inode with compression before direct IO ...
2014-04-03mm + fs: store shadow entries in page cacheJohannes Weiner
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03mm + fs: prepare for non-page entries in page cache radix treesJohannes Weiner
shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03fs/direct-io.c: remove some left over checksDan Carpenter
We know that "ret > 0" is true here. These tests were left over from commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-21Btrfs: fix a crash of clone with inline extents's splitLiu Bo
xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split of inline extents in __btrfs_drop_extents(). For inline extents, we cannot duplicate another EXTENT_DATA item, because it breaks the rule of inline extents, that is, 'start offset' needs to be 0. We have set limitations for the source inode's compressed inline extents, because it needs to decompress and recompress. Now the destination inode's inline extents also need similar limitations. With this, xfstests btrfs/035 doesn't run into panic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21btrfs: fix uninit variable warningChris Mason
fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this function Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21Btrfs: take into account total references when doing backref lookupJosef Bacik
I added an optimization for large files where we would stop searching for backrefs once we had looked at the number of references we currently had for this extent. This works great most of the time, but for snapshots that point to this extent and has changes in the original root this assumption falls on it face. So keep track of any delayed ref mods made and add in the actual ref count as reported by the extent item and use that to limit how far down an inode we'll search for extents. Thanks, Reportedy-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com> Reported-by: Hugo Mills <hugo@carfax.org.uk> Tested-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21Btrfs: part 2, fix incremental send's decision to delay a dir move/renameFilipe Manana
For an incremental send, fix the process of determining whether the directory inode we're currently processing needs to have its move/rename operation delayed. We were ignoring the fact that if the inode's new immediate ancestor has a higher inode number than ours but wasn't renamed/moved, we might still need to delay our move/rename, because some other ancestor directory higher in the hierarchy might have an inode number higher than ours *and* was renamed/moved too - in this case we have to wait for rename/move of that ancestor to happen before our current directory's rename/move operation. Simple steps to reproduce this issue: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/x1/x2 $ mkdir /mnt/a/Z $ mkdir -p /mnt/a/x1/x2/x3/x4/x5 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33 $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The incremental send caused the kernel code to enter an infinite loop when building the path string for directory Z after its references are processed. A more complex scenario: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b/c/d $ mkdir /mnt/a/b/c/d/e $ mkdir /mnt/a/b/c/d/f $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2 $ mkdir /mmt/a/b/c/g $ mv /mnt/a/b/c/d /mnt/a/b/D2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/o $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2 $ mv /mnt/a/b/D2 /mnt/a/b/dd $ mv /mnt/a/b/c /mnt/a/C2 $ mv /mnt/a/b/dd/f /mnt/a/o/FF $ mv /mnt/a/b /mnt/a/o/FF/E2/BB $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-21Btrfs: fix incremental send's decision to delay a dir move/renameFilipe Manana
It's possible to change the parent/child relationship between directories in such a way that if a child directory has a higher inode number than its parent, it doesn't necessarily means the child rename/move operation can be performed immediately. The parent migth have its own rename/move operation delayed, therefore in this case the child needs to have its rename/move operation delayed too, and be performed after its new parent's rename/move. Steps to reproduce the issue: $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/C $ mv /mnt/C /mnt/A $ mv /mnt/B /mnt/A/C $ mkdir /mnt/A/C/D $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/A/C/D /mnt/A/D2 $ mv /mnt/A/C/B /mnt/A/D2/B2 $ mv /mnt/A/C /mnt/A/D2/B2/C2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The incremental send caused the kernel code to enter an infinite loop when building the path string for directory C after its references are processed. The necessary conditions here are that C has an inode number higher than both A and B, and B as an higher inode number higher than A, and D has the highest inode number, that is: inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D) The same issue could happen if after the first snapshot there's any number of intermediary parent directories between A2 and B2, and between B2 and C2. A test case for xfstests follows, covering this simple case and more advanced ones, with files and hard links created inside the directories. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20Btrfs: remove unnecessary inode generation lookup in sendFilipe Manana
No need to search in the send tree for the generation number of the inode, we already have it in the recorded_ref structure passed to us. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20Btrfs: fix race when updating existing ref headFilipe Manana
While we update an existing ref head's extent_op, we're not holding its spinlock, so while we're updating its extent_op contents (key, flags) we can have a task running __btrfs_run_delayed_refs() that holds the ref head's lock and sets its extent_op to NULL right after the task updating the ref head just checked its extent_op was not NULL. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20btrfs: Add trace for btrfs_workqueue alloc/destroyQu Wenruo
Since most of the btrfs_workqueue is printed as pointer address, for easier analysis, add trace for btrfs_workqueue alloc/destroy. So it is possible to determine the workqueue that a given work belongs to(by comparing the wq pointer address with alloc trace event). Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20Btrfs: less fs tree lock contention when using autodefragFilipe Manana
When finding new extents during an autodefrag, don't do so many fs tree lookups to find an extent with a size smaller then the target treshold. Instead, after each fs tree forward search immediately unlock upper levels and process the entire leaf while holding a read lock on the leaf, since our leaf processing is very fast. This reduces lock contention, allowing for higher concurrency when other tasks want to write/update items related to other inodes in the fs tree, as we're not holding read locks on upper tree levels while processing the leaf and we do less tree searches. Test: sysbench --test=fileio --file-num=512 --file-total-size=16G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \ --max-requests=10000000000 [prepare|run] (fileystem mounted with -o autodefrag, averages of 5 runs) Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb After this change: 63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20Btrfs: return EPERM when deleting a default subvolumeGuangyu Sun
The error message is confusing: # btrfs sub delete /mnt/mysub/ Delete subvolume '/mnt/mysub' ERROR: cannot delete '/mnt/mysub' - Directory not empty The error message does not make sense to me: It's not about deleting a directory but it's a subvolume, and it doesn't matter if the subvolume is empty or not. Maybe EPERM or is more appropriate in this case, combined with an explanatory kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because it is configured as default subvolume.") Reported-by: Koen De Wit <koen.de.wit@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20Btrfs: add missing kfree in btrfs_destroy_workqueueFilipe Manana
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20Btrfs: cache extent states in defrag code pathFilipe Manana
When locking file ranges in the inode's io_tree, cache the first extent state that belongs to the target range, so that when unlocking the range we don't need to search in the io_tree again, reducing cpu time and making and therefore holding the io_tree's lock for a shorter period. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20Btrfs: fix deadlock with nested trans handlesJosef Bacik
Zach found this deadlock that would happen like this btrfs_end_transaction <- reduce trans->use_count to 0 btrfs_run_delayed_refs btrfs_cow_block find_free_extent btrfs_start_transaction <- increase trans->use_count to 1 allocate chunk btrfs_end_transaction <- decrease trans->use_count to 0 btrfs_run_delayed_refs lock tree block we are cowing above ^^ We need to only decrease trans->use_count if it is above 1, otherwise leave it alone. This will make nested trans be the only ones who decrease their added ref, and will let us get rid of the trans->use_count++ hack if we have to commit the transaction. Thanks, cc: stable@vger.kernel.org Reported-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-03-10Btrfs: fix possible empty list access when flushing the delalloc inodesMiao Xie
We didn't have a lock to protect the access to the delalloc inodes list, that is we might access a empty delalloc inodes list if someone start flushing delalloc inodes because the delalloc inodes were moved into a other list temporarily. Fix it by wrapping the access with a lock. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: split the global ordered extents mutexMiao Xie
When we create a snapshot, we just need wait the ordered extents in the source fs/file root, but because we use the global mutex to protect this ordered extents list of the source fs/file root to avoid accessing a empty list, if someone got the mutex to access the ordered extents list of the other fs/file root, we had to wait. This patch splits the above global mutex, now every fs/file root has its own mutex to protect its own list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lockMiao Xie
We needn't flush all delalloc inodes when we doesn't get s_umount lock, or we would make the tasks wait for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: reclaim delalloc metadata more aggressivelyMiao Xie
generic/074 in xfstests failed sometimes because of the enospc error, the reason of this problem is that we just reclaimed the space we need from the reserved space for delalloc, and then tried to reserve the space, but if some task did no-flush reservation between the above reclamation and reservation, Task1 Task2 shrink_delalloc() reclaim 1 block (The space that can be reserved now is 1 block) do no-flush reservation reserve 1 block (The space that can be reserved now is 0 block) reserving 1 block failed the reservation of Task1 failed, but in fact, there was enough space to reserve if we could reclaim more space before. Fix this problem by the aggressive reclamation of the reserved delalloc metadata space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: remove unnecessary lock in may_commit_transaction()Miao Xie
The reason is: - The per-cpu counter has its own lock to protect itself. - Here we needn't get a exact value. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: remove the unnecessary flush when preparing the pagesMiao Xie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: just do dirty page flush for the inode with compression before direct IOMiao Xie
As the comment in the btrfs_direct_IO says, only the compressed pages need be flush again to make sure they are on the disk, but the common pages needn't, so we add a if statement to check if the inode has compressed pages or not, if no, skip the flush. And in order to prevent the write ranges from intersecting, we need wait for the running ordered extents. But the current code waits for them twice, one is done before the direct IO starts (in btrfs_wait_ordered_range()), the other is before we get the blocks, it is unnecessary. because we can do the direct IO without holding i_mutex, it means that the intersected ordered extents may happen during the direct IO, the first wait can not avoid this problem. So we use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove the first wait. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: wake up the tasks that wait for the io earlierMiao Xie
The tasks that wait for the IO_DONE flag just care about the io of the dirty pages, so it is better to wake up them immediately after all the pages are written, not the whole process of the io completes. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix early enospc due to the race of the two ordered extent waitMiao Xie
btrfs_wait_ordered_roots() moves all the list entries to a new list, and then deals with them one by one. But if the other task invokes this function at that time, it would get a empty list. It makes the enospc error happens more early. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolumeMiao Xie
If the snapshot creation happened after the nocow write but before the dirty data flush, we would fail to flush the dirty data because of no space. So we must keep track of when those nocow write operations start and when they end, if there are nocow writers, the snapshot creators must wait. In order to implement this function, I introduce btrfs_{start, end}_nocow_write(), which is similar to mnt_{want,drop}_write(). These two functions are only used for nocow file write operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Add ftrace for btrfs_workqueueQu Wenruo
Add ftrace for btrfs_workqueue for further workqueue tunning. This patch needs to applied after the workqueue replace patchset. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Cleanup the btrfs_workqueue related function typeQu Wenruo
The new btrfs_workqueue still use open-coded function defition, this patch will change them into btrfs_func_t type which is much the same as kernel workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: add readahead for send_writeLiu Bo
Btrfs send reads data from disk and then writes to a stream via pipe or a file via flush. Currently we're going to read each page a time, so every page results in a disk read, which is not friendly to disks, esp. HDD. Given that, the performance can be gained by adding readahead for those pages. Here is a quick test: $ btrfs subvolume create send $ xfs_io -f -c "pwrite 0 1G" send/foobar $ btrfs subvolume snap -r send ro $ time "btrfs send ro -f /dev/null" w/o w real 1m37.527s 0m9.097s user 0m0.122s 0m0.086s sys 0m53.191s 0m12.857s Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: share the same code for __record_{new,deleted}_refLiu Bo
This has no functional change, only picks out the same part of two functions, and makes it shared. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: avoid unnecessary utimes update in incremental sendFilipe Manana
When we're finishing processing of an inode, if we're dealing with a directory inode that has a pending move/rename operation, we don't need to send a utimes update instruction to the send stream, as we'll do it later after doing the move/rename operation. Therefore we save some time here building paths and doing btree lookups. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: make defrag not fragment files when using prealloc extentsFilipe Manana
When using prealloc extents, a file defragment operation may actually fragment the file and increase the amount of data space used by the file. This change fixes that behaviour. Example: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt $ cd /mnt $ xfs_io -f -c 'falloc 0 1048576' foobar && sync $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync Before defragmenting the file: $ btrfs filesystem df /mnt Data, single: total=8.00MiB, used=1.25MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00 $ btrfs-debug-tree /dev/sdb3 (...) item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 0 nr 4096 item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 4096 nr 102400 ram 1048576 extent compression 0 item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 106496 nr 90112 item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 196608 nr 106496 ram 1048576 extent compression 0 item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 303104 nr 593920 item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 897024 nr 106496 ram 1048576 extent compression 0 item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 1003520 nr 45056 (...) Now defragmenting the file results in more data space used than before: $ btrfs filesystem defragment -f foobar && sync $ btrfs filesystem df /mnt Data, single: total=8.00MiB, used=1.55MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00 And the corresponding file extent items are now no longer perfectly sequential as before, and we're now needlessly using more space from data block groups: $ btrfs-debug-tree /dev/sdb3 (...) item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 0 nr 4096 ram 1048576 extent compression 0 item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53 extent data disk byte 13893632 nr 102400 extent data offset 0 nr 102400 ram 102400 extent compression 0 item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 106496 nr 90112 ram 1048576 extent compression 0 item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53 extent data disk byte 13996032 nr 106496 extent data offset 0 nr 106496 ram 106496 extent compression 0 item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 303104 nr 593920 item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53 extent data disk byte 14102528 nr 106496 extent data offset 0 nr 106496 ram 106496 extent compression 0 item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 1003520 nr 45056 ram 1048576 extent compression 0 (...) With this change, the above example will no longer cause allocation of new data space nor change the sequentiality of the file extents, that is, defragment will be effectless, leaving all extent items pointing to the extent starting at disk byte 12845056. In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of 400Mb each, initially consisting of a single prealloc extent of 400Mb, having random writes happening at a low rate, lead to a total of over ~17Gb of data space used, not far from eventually reaching an ENOSPC state. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: correctly flush data on defrag when compression is enabledFilipe Manana
When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression enabled, we weren't flushing completely, as writing compressed extents is a 2 steps process, one to compress the data and another one to write the compressed data to disk. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Cleanup the "_struct" suffix in btrfs_workequeueQu Wenruo
Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Cleanup the old btrfs_worker.Qu Wenruo
Since all the btrfs_worker is replaced with the newly created btrfs_workqueue, the old codes can be easily remove. Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->scrub_* with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->qgroup_rescan_worker with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->delayed_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->fixup_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->readahead_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->cache_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->rmw_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.Qu Wenruo
Replace the fs_info->endio_* workqueues with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->flush_workers with btrfs_workqueue.Qu Wenruo
Replace the fs_info->submit_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->submit_workers with btrfs_workqueue.Qu Wenruo
Much like the fs_info->workers, replace the fs_info->submit_workers use the same btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->delalloc_workers with btrfs_workqueueQu Wenruo
Much like the fs_info->workers, replace the fs_info->delalloc_workers use the same btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Replace fs_info->workers with btrfs_workqueue.Qu Wenruo
Use the newly created btrfs_workqueue_struct to replace the original fs_info->workers Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Add threshold workqueue based on kernel workqueueQu Wenruo
The original btrfs_workers has thresholding functions to dynamically create or destroy kthreads. Though there is no such function in kernel workqueue because the worker is not created manually, we can still use the workqueue_set_max_active to simulated the behavior, mainly to achieve a better HDD performance by setting a high threshold on submit_workers. (Sadly, no resource can be saved) So in this patch, extra workqueue pending counters are introduced to dynamically change the max active of each btrfs_workqueue_struct, hoping to restore the behavior of the original thresholding function. Also, workqueue_set_max_active use a mutex to protect workqueue_struct, which is not meant to be called too frequently, so a new interval mechanism is applied, that will only call workqueue_set_max_active after a count of work is queued. Hoping to balance both the random and sequence performance on HDD. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Add high priority workqueue support for btrfs_workqueue_structQu Wenruo
Add high priority function to btrfs_workqueue. This is implemented by embedding a btrfs_workqueue into a btrfs_workqueue and use some helper functions to differ the normal priority wq and high priority wq. So the high priority wq is completely independent from the normal workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>