summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/transaction.h
AgeCommit message (Collapse)Author
2009-03-24Btrfs: do extent allocation and reference count updates in the backgroundChris Mason
The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05Btrfs: Fix checkpatch.pl warningsChris Mason
There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-11Btrfs: fix leaking block group on balanceYan Zheng
The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-17Btrfs: Allow subvolumes and snapshots anywhere in the directory treeChris Mason
Before, all snapshots and subvolumes lived in a single flat directory. This was awkward and confusing because the single flat directory was only writable with the ioctls. This commit changes the ioctls to create subvols and snapshots at any point in the directory tree. This requires making separate ioctls for snapshot and subvol creation instead of a combining them into one. The subvol ioctl does: btrfsctl -S subvol_name parent_dir After the ioctl is done subvol_name lives inside parent_dir. The snapshot ioctl does: btrfsctl -s path_for_snapshot root_to_snapshot path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up into directory and basename components. root_to_snapshot can be any file or directory in the FS. The snapshot is taken of the entire root where that file lives. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Record dirty pages tree-log pages in an extent_io treeChris Mason
This is the same way the transaction code makes sure that all the other tree blocks are safely on disk. There's an extent_io tree for each root, and any blocks allocated to the tree logs are recorded in that tree. At tree-log sync, the extent_io tree is walked to flush down the dirty pages and wait for them. The main benefit is less time spent walking the tree log and skipping clean pages, and getting sequential IO down to the drive. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a write ahead tree log to optimize synchronous operationsChris Mason
File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Various small fixes.Yan Zheng
This trivial patch contains two locking fixes and a off by one fix. --- Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: fix ioctl-initiated transactions vs wait_current_trans()Sage Weil
Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in progress) breaks the transaction start/stop ioctls by making btrfs_start_transaction conditionally wait for the next transaction to start. If an application artificially is holding a transaction open, things deadlock. This workaround maintains a count of open ioctl-initiated transactions in fs_info, and avoids wait_current_trans() if any are currently open (in start_transaction() and btrfs_throttle()). The start transaction ioctl uses a new btrfs_start_ioctl_transaction() that _does_ call wait_current_trans(), effectively pushing the join/wait decision to the outer ioctl-initiated transaction. This more or less neuters btrfs_throttle() when ioctl-initiated transactions are in use, but that seems like a pretty fundamental consequence of wrapping lots of write()'s in a transaction. Btrfs has no way to tell if the application considers a given operation as part of it's transaction. Obviously, if the transaction start/stop ioctls aren't being used, there is no effect on current behavior. Signed-off-by: Sage Weil <sage@newdream.net> --- ctree.h | 1 + ioctl.c | 12 +++++++++++- transaction.c | 18 +++++++++++++----- transaction.h | 2 ++ 4 files changed, 27 insertions(+), 6 deletions(-) Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Update and fix mount -o nodatacowYan Zheng
To check whether a given file extent is referenced by multiple snapshots, the checker walks down the fs tree through dead root and checks all tree blocks in the path. We can easily detect whether a given tree block is directly referenced by other snapshot. We can also detect any indirect reference from other snapshot by checking reference's generation. The checker can always detect multiple references, but can't reliably detect cases of single reference. So btrfs may do file data cow even there is only one reference. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Throttle operations if the reference cache gets too largeChris Mason
A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs_start_transaction: wait for commits in progress to finishChris Mason
btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: New data=ordered implementationChris Mason
The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Online btree defragmentation fixesChris Mason
The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Replace the transaction work queue with kthreadsChris Mason
This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Add btrfs_end_transaction_throttle to force writers to wait for pending commitsChris Mason
The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Split the extent_map code into two partsChris Mason
There is now extent_map for mapping offsets in the file to disk and extent_io for state tracking, IO submission and extent_bufers. The new extent_map code shifts from [start,end] pairs to [start,len], and pushes the locking out into the caller. This allows a few performance optimizations and is easier to use. A number of extent_map usage bugs were fixed, mostly with failing to remove extent_map entries when changing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Move snapshot creation to commit timeChris Mason
It is very difficult to create a consistent snapshot of the btree when other writers may update the btree before the commit is done. This changes the snapshot creation to happen during the commit, while no other updates are possible. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add data=ordered supportChris Mason
This forces file data extents down the disk along with the metadata that references them. The current implementation is fairly simple, and just writes out all of the dirty pages in an inode before the commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Back port to 2.6.18-el kernelsChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Create extent_buffer interface for large blocksizesChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-11Btrfs: Find and remove dead roots the first time a root is loaded.Chris Mason
Dead roots are trees left over after a crash, and they were either in the process of being removed or were waiting to be removed when the box crashed. Before, a search of the entire tree of root pointers was done on mount looking for dead roots. Now, the search is done the first time we load a root. This makes mount faster when there are a large number of snapshots, and it enables the block accounting code to properly update the block counts on the latest root as old versions of the root are reaped after a crash. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-10Btrfs: delay commits during fsync to allow more writersJosef Bacik
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-10Btrfs: Btree defrag on the extent-mapping tree as wellChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08Btrfs: Replace extent tree preallocation code with some bit radix magic.Chris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07Btrfs: Add run time btree defrag, and an ioctl to force btree defragChris Mason
This adds two types of btree defrag, a run time form that tries to defrag recently allocated blocks in the btree when they are still in ram, and an ioctl that forces defrag of all btree blocks. File data blocks are not defragged yet, but this can make a huge difference in sequential btree reads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-22Btrfs: Add the ability to find and remove dead roots after a crash.Chris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12Btrfs: add GPLv2Chris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-08Btrfs: get forced transaction commits via workqueueChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-08Btrfs: add compat ioctlChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-30Btrfs: allocator improvements, inode block groupsChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-28Btrfs: smarter transaction writebackChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-19Btrfs: early fsync supportChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-02Btrfs: still corruption huntingChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-22Btrfs: transaction reworkChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-22Mountable btrfs, with readdirChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-16Btrfs: transaction handles everywhereChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>