summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/tree-log.c
AgeCommit message (Collapse)Author
2011-08-16Btrfs: fix an oops of log replayliubo
When btrfs recovers from a crash, it may hit the oops below: ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:4580! [...] RIP: 0010:[<ffffffffa03df251>] [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs] [...] Call Trace: [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs] [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs] [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs] [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs] [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs] [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs] [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs] [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs] [...] This comes from that while replaying an inode ref item, we forget to check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree, then we will come to conflict corners which lead to BUG_ON(). Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Tested-by: Andy Lutomirski <luto@mit.edu> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Merge branch 'alloc_path' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling into for-linus
2011-07-27Btrfs: switch the btrfs tree locks to reader/writerChris Mason
The btrfs metadata btree is the source of significant lock contention, especially in the root node. This commit changes our locking to use a reader/writer lock. The lock is built on top of rw spinlocks, and it extends the lock tracking to remember if we have a read lock or a write lock when we go to blocking. Atomics count the number of blocking readers or writers at any given time. It removes all of the adaptive spinning from the old code and uses only the spinning/blocking hints inside of btrfs to decide when it should continue spinning. In read heavy workloads this is dramatically faster. In write heavy workloads we're still faster because of less contention on the root node lock. We suffer slightly in dbench because we schedule more often during write locks, but all other benchmarks so far are improved. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-14btrfs: Don't BUG_ON alloc_path errors in replay_one_buffer()Mark Fasheh
The two ->process_func call sites in tree-log.c which were ignoring a return code have also been updated to gracefully exit as well. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-06-17btrfs: fix dereference of ERR_PTR valueDavid Sterba
smatch reports: btrfs_recover_log_trees error: 'wc.replay_dest' dereferencing possible ERR_PTR() Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23Merge branch 'cleanups_and_fixes' into inode_numbersChris Mason
Conflicts: fs/btrfs/tree-log.c fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23Btrfs: check return value of btrfs_inc_extent_ref()Tsutomu Itoh
If return value of btrfs_inc_extent_ref() is not 0, BUG() is called. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23Btrfs: return error to caller if read_one_inode() failsTsutomu Itoh
When read_one_inode() fails, error code is returned to caller instead of BUG_ON(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & ↵Tsutomu Itoh
btrfs_extend_item Currently, btrfs_truncate_item and btrfs_extend_item returns only 0. So, the check by BUG_ON in the caller is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23Btrfs: return error code to caller when btrfs_del_item failsTsutomu Itoh
The error code is returned instead of calling BUG_ON when btrfs_del_item returns the error. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23Btrfs: do not flush csum items of unchanged file data during treelogliubo
The current code relogs the entire inode every time during fsync log, and it is much better suited to small files rather than large ones. During my performance test, the fsync performace of large files sucks, and we can ascribe this to the tremendous amount of csum infos of the large ones, cause we have to flush all of these csum infos into log trees even when there are only _one_ change in the whole file data. Apparently, to optimize fsync, we need to create a filter to skip the unnecessary csum ones, that is, the corresponding file data remains unchanged before this fsync. Here I have some test results to show, I use sysbench to do "random write + fsync". === sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run] === Sysbench args: - Number of threads: 1 - Extra file open flags: 0 - 2 files, 4Gb each - Block size 4Kb - Number of random requests for random IO: 10000 - Read/Write ratio for combined random IO test: 1.50 - Periodic FSYNC enabled, calling fsync() each 100 requests. - Calling fsync() at the end of test, Enabled. - Using synchronous I/O mode - Doing random write test Sysbench results: === Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total Read 0b Written 39.062Mb Total transferred 39.062Mb === a) without patch: (*SPEED* : 451.01Kb/sec) 112.75 Requests/sec executed b) with patch: (*SPEED* : 4.7533Mb/sec) 1216.84 Requests/sec executed PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch, and I'm wanderring where the problem is and trying to improve it more. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23Merge branch 'for-chris' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers Conflicts: fs/btrfs/Makefile fs/btrfs/ctree.h fs/btrfs/volumes.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into ↵Chris Mason
inode_numbers Conflicts: fs/btrfs/extent-tree.c fs/btrfs/free-space-cache.c fs/btrfs/inode.c fs/btrfs/tree-log.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22Merge branch 'delayed_inode' into inode_numbersChris Mason
Conflicts: fs/btrfs/inode.c fs/btrfs/ioctl.c fs/btrfs/transaction.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21btrfs: implement delayed inode items operationMiao Xie
Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dave@jikos.cz> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into ↵Chris Mason
inode_numbers Conflicts: fs/btrfs/free-space-cache.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-12btrfs: scrubArne Jansen
This adds an initial implementation for scrub. It works quite straightforward. The usermode issues an ioctl for each device in the fs. For each device, it enumerates the allocated device chunks. For each chunk, the contained extents are enumerated and the data checksums fetched. The extents are read sequentially and the checksums verified. If an error occurs (checksum or EIO), a good copy is searched for. If one is found, the bad copy will be rewritten. All enumerations happen from the commit roots. During a transaction commit, the scrubs get paused and afterwards continue from the new roots. This commit is based on the series originally posted to linux-btrfs with some improvements that resulted from comments from David Sterba, Ilya Dryomov and Jan Schmidt. Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-02btrfs: drop unused parameter from btrfs_release_pathDavid Sterba
parameter tree root it's not used since commit 5f39d397dfbe140a14edecd4e73c34ce23c4f9ee ("Btrfs: Create extent_buffer interface for large blocksizes") Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02btrfs: unify checking of IS_ERR and nullDavid Sterba
use IS_ERR_OR_NULL when possible, done by this coccinelle script: @ match @ identifier id; @@ ( - BUG_ON(IS_ERR(id) || !id); + BUG_ON(IS_ERR_OR_NULL(id)); | - IS_ERR(id) || !id + IS_ERR_OR_NULL(id) | - !id || IS_ERR(id) + IS_ERR_OR_NULL(id) ) Signed-off-by: David Sterba <dsterba@suse.cz>
2011-04-25Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()Tsutomu Itoh
It is necessary to unlock mutex_lock before it return an error when btrfs_alloc_path() fails. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25Btrfs: Always use 64bit inode numberLi Zefan
There's a potential problem in 32bit system when we exhaust 32bit inode numbers and start to allocate big inode numbers, because btrfs uses inode->i_ino in many places. So here we always use BTRFS_I(inode)->location.objectid, which is an u64 variable. There are 2 exceptions that BTRFS_I(inode)->location.objectid != inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2), and inode->i_ino will be used in those cases. Another reason to make this change is I'm going to use a special inode to save free ino cache, and the inode number must be > (u64)-256. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-03-28btrfs: make inode ref log recovery fasterliubo
When we recover from crash via write-ahead log tree and process the inode refs, for each btrfs_inode_ref item, we will 1) check if we already have a perfect match in fs/file tree, if we have, then we're done. 2) search the corresponding back reference in fs/file tree, and check all the names in this back reference to see if they are also in the log to avoid conflict corners. 3) recover the logged inode refs to fs/file tree. In current btrfs, however, - for 2)'s check, once is enough, since the checked back reference will remain unchanged after processing all the inode refs belonged to the key. - it has no need to do another 1) between 2) and 3). I've made a small test to show how it improves, $dd if=/dev/zero of=foobar bs=4K count=1 $sync $make 100 hard links continuously, like ln foobar link_i $fsync foobar $echo b > /proc/sysrq-trigger after reboot $time mount DEV PATH without patch: real 0m0.285s user 0m0.001s sys 0m0.009s with patch: real 0m0.123s user 0m0.000s sys 0m0.010s Changelog v1->v2: - fix double free - pointed by David Sterba Changelog v2->v3: - adjust free order Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: cleanup some BUG_ON()Tsutomu Itoh
This patch changes some BUG_ON() to the error return. (but, most callers still use BUG_ON()) Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-17Btrfs: add checks to verify dir items are correctJosef Bacik
We need to make sure the dir items we get are valid dir items. So any time we try and read one check it with verify_dir_item, which will do various sanity checks to make sure it looks sane. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-02-01btrfs: fix return value check of btrfs_start_transaction()Tsutomu Itoh
The error check of btrfs_start_transaction() is added, and the mistake of the error check on several places is corrected. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-01btrfs: checking NULL or not in some functionsTsutomu Itoh
Because NULL is returned when the memory allocation fails, it is checked whether it is NULL. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-31Btrfs: catch errors from btrfs_sync_logChris Mason
btrfs_sync_log returns -EAGAIN when we need full transaction commits instead of small log commits, but sometimes we were dropping the return value. In practice, we check for this a few different ways, but this is still a bug that can leave off full log commits when we really need them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-28btrfs: fix several uncheck memory allocationsliubo
To make btrfs more stable, add several missing necessary memory allocation checks, and when no memory, return proper errno. We've checked that some of those -ENOMEM errors will be returned to userspace, and some will be catched by BUG_ON() in the upper callers, and none will be ignored silently. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21Btrfs: use dget_parent where we can UPDATEDJosef Bacik
There are lots of places where we do dentry->d_parent->d_inode without holding the dentry->d_lock. This could cause problems with rename. So instead we need to use dget_parent() and hold the reference to the parent as long as we are going to use it's inode and then dput it at the end. Signed-off-by: Josef Bacik <josef@redhat.com> Cc: raven@themaw.net Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29Btrfs: cleanup warnings from gcc 4.6 (nonbugs)Andi Kleen
These are all the cases where a variable is set, but not read which are not bugs as far as I can see, but simply leftovers. Still needs more review. Found by gcc 4.6's new warnings Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29Btrfs: Fix variables set but not read (bugs found by gcc 4.6)Andi Kleen
These are all the cases where a variable is set, but not read which are really bugs. - Couple of incorrect error handling fixed. - One incorrect use of a allocation policy - Some other things Still needs more review. Found by gcc 4.6's new warnings. [akpm@linux-foundation.org: fix build. Might have been bitrot] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: Metadata ENOSPC handling for tree logYan, Zheng
Previous patches make the allocater return -ENOSPC if there is no unreserved free metadata space. This patch updates tree log code and various other places to propagate/handle the ENOSPC error. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-15Btrfs: change how we mount subvolumesJosef Bacik
This work is in preperation for being able to set a different root as the default mounting root. There is currently a problem with how we mount subvolumes. We cannot currently mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the default subvolume. So say you take a snapshot of the default subvolume and call it snap1, and then take a snapshot of snap1 and call it snap2, so now you have / /snap1 /snap1/snap2 as your available volumes. Currently you can only mount / and /snap1, you cannot mount /snap1/snap2. To fix this problem instead of passing subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is the tree id that gets spit out via the subvolume listing you get from the subvolume listing patches (btrfs filesystem list). This allows us to mount /, /snap1 and /snap1/snap2 as the root volume. In addition to the above, we also now read the default dir item in the tree root to get the root key that it points to. For now this just points at what has always been the default subvolme, but later on I plan to change it to point at whatever root you want to be the new default root, so you can just set the default mount and not have to mount with -o subvolid=<treeid>. I tested this out with the above scenario and it worked perfectly. Thanks, mount -o subvol operates inside the selected subvolid. For example: mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt /mnt will have the snap1 directory for the subvolume with id 256. mount -o subvol=snap /dev/xxx /mnt /mnt will be the snap directory of whatever the default subvolume is. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17Btrfs: Avoid orphan inodes cleanup while replaying logYan, Zheng
We do log replay in a single transaction, so it's not good to do unbound operations. This patch cleans up orphan inodes cleanup after replaying the log. It also avoids doing other unbound operations such as truncating a file during replaying log. These unbound operations are postponed to the orphan inode cleanup stage. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15Btrfs: Rewrite btrfs_drop_extentsYan, Zheng
Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can avoid calling lock_extent within transaction. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15Btrfs: Avoid superfluous tree-log writeoutYan, Zheng
We allow two log transactions at a time, but use same flag to mark dirty tree-log btree blocks. So we may flush dirty blocks belonging to newer log transaction when committing a log transaction. This patch fixes the issue by using two flags to mark dirty tree-log btree blocks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-15Merge branch 'master' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: always pin metadata in discard mode Btrfs: enable discard support Btrfs: add -o discard option Btrfs: properly wait log writers during log sync Btrfs: fix possible ENOSPC problems with truncate Btrfs: fix btrfs acl #ifdef checks Btrfs: streamline tree-log btree block writeout Btrfs: avoid tree log commit when there are no changes Btrfs: only write one super copy during fsync
2009-10-14Btrfs: properly wait log writers during log syncYan, Zheng
A recently fsync optimization make btrfs_sync_log skip calling wait_for_writer in the single log writer case. This is incorrect since the writer count can also be increased by btrfs_pin_log. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-13Btrfs: streamline tree-log btree block writeoutChris Mason
Syncing the tree log is a 3 phase operation. 1) write and wait for all the tree log blocks for a given root. 2) write and wait for all the tree log blocks for the tree of tree log roots. 3) write and wait for the super blocks (barriers here) This isn't as efficient as it could be because there is no requirement to wait for the blocks from step one to hit the disk before we start writing the blocks from step two. This commit changes the sequence so that we don't start waiting until all the tree blocks from both steps one and two have been sent to disk. We do this by breaking up btrfs_write_wait_marked_extents into two functions, which is trivial because it was already broken up into two parts. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-13Btrfs: avoid tree log commit when there are no changesChris Mason
rpm has a habit of running fdatasync when the file hasn't changed. We already detect if a file hasn't been changed in the current transaction but it might have been sent to the tree-log in this transaction and not changed since the last call to fsync. In this case, we want to avoid a tree log sync, which includes a number of synchronous writes and barriers. This commit extends the existing tracking of the last transaction to change a file to also track the last sub-transaction. The end result is that rpm -ivh and -Uvh are roughly twice as fast, and on par with ext3. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-13Btrfs: only write one super copy during fsyncChris Mason
During a tree-log commit for fsync, we've been writing at least two copies of the super block and forcing them to disk. The other filesystems write only one, and this change brings us on par with them. A full transaction commit will write all the super copies, so we still have redundant info written on a regular basis. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix file clone ioctl for bookend extents Btrfs: fix uninit compiler warning in cow_file_range_nocow Btrfs: constify dentry_operations Btrfs: optimize back reference update during btrfs_drop_snapshot Btrfs: remove negative dentry when deleting subvolumne Btrfs: optimize fsync for the single writer case Btrfs: async delalloc flushing under space pressure Btrfs: release delalloc reservations on extent item insertion Btrfs: delay clearing EXTENT_DELALLOC for compressed extents Btrfs: cleanup extent_clear_unlock_delalloc flags Btrfs: fix possible softlockup in the allocator Btrfs: fix deadlock on async thread startup
2009-10-08Btrfs: optimize fsync for the single writer caseJosef Bacik
This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie for new people to join the logging transaction if there is only ever 1 writer. This helps a little bit with latency where we have something like RPM where it will fdatasync every file it writes, and so waiting the 1 jiffie for every fdatasync really starts to add up. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-24Merge branch 'master' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus Conflicts: fs/btrfs/super.c
2009-09-21Btrfs: add snapshot/subvolume destroy ioctlYan, Zheng
This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being used and doesn't contains links to other subvolumes can be destroyed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21Btrfs: do not reuse objectid of deleted snapshot/subvolYan, Zheng
The new back reference format does not allow reusing objectid of deleted snapshot/subvol. So we use ++highest_objectid to allocate objectid for new snapshot/subvol. Now we use ++highest_objectid to allocate objectid for both new inode and new snapshot/subvolume, so this patch removes 'find hole' code in btrfs_find_free_objectid. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21trivial: remove unnecessary semicolonsJoe Perches
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-09-17Btrfs: improve async block group cachingYan Zheng
This patch gets rid of two limitations of async block group caching. The old code delays handling pinned extents when block group is in caching. To allocate logged file extents, the old code need wait until block group is fully cached. To get rid of the limitations, This patch introduces a data structure to track the progress of caching. Base on the caching progress, we know which extents should be added to the free space cache when handling the pinned extents. The logged file extents are also handled in a similar way. This patch also changes how pinned extents are tracked. The old code uses one tree to track pinned extents, and copy the pinned extents tree at transaction commit time. This patch makes it use two trees to track pinned extents. One tree for extents that are pinned in the running transaction, one tree for extents that can be unpinned. At transaction commit time, we swap the two trees. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11Btrfs: Fix extent replacment raceChris Mason
Data COW means that whenever we write to a file, we replace any old extent pointers with new ones. There was a window where a readpage might find the old extent pointers on disk and cache them in the extent_map tree in ram in the middle of a given write replacing them. Even though both the readpage and the write had their respective bytes in the file locked, the extent readpage inserts may cover more bytes than it had locked down. This commit closes the race by keeping the new extent pinned in the extent map tree until after the on-disk btree is properly setup with the new extent pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>