summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2011-10-19Btrfs: use the transactions block_rsv for the csum rootJosef Bacik
The alloc warnings everybody has been seeing is because we have been reserving space for csums, but we weren't actually using that space. So make get_block_rsv() return the trans->block_rsv if we're modifying the csum root. Also set the trans->block_rsv to NULL so that if we modify the csum root when running delayed ref's that comes out of the global reserve like it's supposed to. With this patch I'm not seeing those alloc warnings anymore. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: handle enospc accounting for free space inodesJosef Bacik
Since free space inodes now use normal checksumming we need to make sure to account for their metadata use. So reserve metadata space, and then if we fail to write out the metadata we can just release it, otherwise it will be freed up when the io completes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: put the block group cache after we commit the superJosef Bacik
In moving some enospc stuff around I noticed that when we unmount we are often evicting the free space cache inodes before we do our last commit. This isn't bad, but it makes us constantly have to re-read the inodes back. So instead don't evict the cache until after we do our last commit, this will make things a little less crappy and makes a future enospc change work properly. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: set truncate block rsv's sizeJosef Bacik
While debugging a different issue I noticed that we were always reserving space when we tried to use our truncate block rsv's. This is because they didn't have a ->size value, so use_block_rsv just assumes there is nothing reserved and it does a reserve_metadata_bytes. This is because btrfs_check_block_rsv() doesn't actually add to the size of the block rsv. That seems to be the right thing to do so set ->size to the minimum truncate size we need, since we will always only refill to that size anyway, and this way everything works out correctly. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: don't increase the block_rsv's size when emergency allocating spaceJosef Bacik
If we have to emergency reserve space we need to not increase the block_rsv size, otherwise we'll leak space. Take for instance delalloc, say we reserve 4k, and we use that 4k, and then we have to emergency allocate another 4k, we bump the size up to 8k, however we've only accounted for 4k in reservations in all of our supporting logic, so we'll go to free the 4k and end up having a size of 4k, which will cause us to later not free as much space. I saw this doing testing where I wasn't reserving enough space for something but was still leaking space, very frustrating. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: fix space leak when we fail to make an allocationJosef Bacik
When changing back to using a spin_lock to protect the extent counters I decided that since we would only be dropping our original extent, it was ok to just drop the extent and return. However since somebody else could have come in and done a reservation, we need to do the normal song and dance to clear the reservation out properly. So calculate how much space we need to free, and then subtract what we just attempted to reserve. If it's more then we know we need to drop those bytes from the delalloc block rsv. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: fix call to btrfs_search_slot in free space cacheJosef Bacik
We are setting ins_len to 1 even tho we are just modifying an item that should be there already. This may cause the search stuff to split nodes on the way down needelessly. Set this to 0 since we aren't inserting anything. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_checkJosef Bacik
If you run xfstest 224 it you will get lots of messages about not being able to delete inodes and that they will be cleaned up next mount. This is because btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to flush, so if there was not enough space, it simply failed. But in truncate and evict case we could easily flush space to try and get enough space to do our work, so make btrfs_block_rsv_check take a flush argument to pass down to reserve_metadata_bytes. Now xfstests 224 runs fine without all those complaints. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: reduce the amount of space needed for truncatesJosef Bacik
With btrfs_truncate_inode_items we always return if we have to go to another leaf, which makes us do our reservation again. This means we will only ever modify one leaf at a time, so we only need 1 items worth of slack space. Also, since we are deleting we will not be creating nodes as we go down, if anything we'll be free'ing them as we merge them together, so make a different calculation for truncate which will only have the worst case useage of COW'ing the entire path down to the leaf. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: only reserve space in fallocate if we have to do a preallocateJosef Bacik
Lukas found a problem where if he tries to fallocate over the same region twice and the first fallocate took up all the space we would fail with ENOSPC. This is because we reserve the total space we want to use for fallocate, regardless of wether or not we will have to actually preallocate. So instead move the check into the loop where we actually have to do the preallocate. Thanks, Tested-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill btrfs_truncate_reserve_metadataJosef Bacik
Since we've optimized the truncate path, we no longer require this function. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: optimize how we account for space in truncateJosef Bacik
Currently we're starting and stopping a transaction for no real reason, so kill that and just reserve enough space as if we can truncate all in one transaction. Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space we may have to allocate for our slack space. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: don't try to commit in btrfs_block_rsv_checkJosef Bacik
We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot because we have a transaction open it will return EAGAIN, so we do not need to try and commit the transaction again. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill unused parts of block_rsvJosef Bacik
The priority and refill_used flags are not used anymore, and neither is the usage counter, so just remove them from btrfs_block_rsv. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: ratelimit the generation printk for the free space cacheJosef Bacik
A user reported getting spammed when moving to 3.0 by this message. Since we switched to the normal checksumming infrastructure all old free space caches will be wrong and need to be regenerated so people are likely to see this message a lot, so ratelimit it so it doesn't fill up their logs and freak them out. Thanks, Reported-by: Andrew Lutomirski <luto@mit.edu> Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: fix how we reserve space for deleting inodesJosef Bacik
I converted btrfs_truncate to do sane reservations for truncate, but didn't convert btrfs_evict_inode. Basically we need to save the orphan_rsv for deleting the orphan item, and do normal reservations for our truncate. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill the durable block rsv stuffJosef Bacik
This is confusing code and isn't used by anything anymore, so delete it. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill the orphan space calculation for snapshotsJosef Bacik
This patch kills off the calculation for the amount of space needed for the orphan operations during a snapshot. The thing is we only do snapshots on commit, so any space that is in the block_rsv->freed[] isn't going to be in the new snapshot anyway, so there isn't any reason to require that space to be reserved for the snapshot to occur. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: calculate checksum space correctlyJosef Bacik
We have not been reserving enough space for checksums. We were just reserving bytes for the checksum items themselves, we were not taking into account having to cow the tree and such. This patch adds a csum_bytes counter to the inode for keeping track of the number of bytes outstanding we have for checksums. Then we calculate how many leaves would be required for the checksums we are given and use that to reserve space. This adds a significant amount of bytes to our reservations, but we will handle this later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: skip looking for delalloc if we don't have ->fill_delallocJosef Bacik
We always look for delalloc bytes in our io_tree so we can fill in delalloc. This is fine in most cases, but if we're writing out the btree_inode this is just a superfluous tree search on the io_tree, and if we have a lot of metadata dirty this could be an expensive check. So instead check to see if our io_tree has a ->fill_delalloc op, and if not don't even bother doing the lookup. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: use bytes_may_use for all ENOSPC reservationsJosef Bacik
We have been using bytes_reserved for metadata reservations, which is wrong since we use that to keep track of outstanding reservations from the allocator. This resulted in us doing a lot of silly things to make sure we don't allocate a bunch of metadata chunks since we never had a real view of how much space was actually in use by metadata. This passes Arne's enospc test and xfstests as well as my own enospc tests. Hopefully this will get us moving in the right direction. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: fix how we mount subvol=<whatever>Josef Bacik
We've only been able to mount with subvol=<whatever> where whatever was a subvol within whatever root we had as the default. This allows us to mount -o subvol=path/to/subvol/you/want relative from the normal fs_tree root. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: use d_obtain_alias when mounting subvol/subvolidJosef Bacik
Currently what we do is just wrong. We either 1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong as we could walk into this subvol later on via another path and hilarity could ensue. Also we don't check the return value of d_splice_alias which isn't good either. or 2) Do a d_find_alias() which we could have lost our dentry from cache at this point and found nothing. So use d_obtain_alias(). In the case that we already have the inode/dentry in cache we will get the correct dentry. If not we will get a disconnected dentry tree so if we walk into it later on everything will be connected up properly. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill reserved_bytes in inodeJosef Bacik
reserved_bytes is not used for anything in the inode, remove it. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: move stuff around in btrfs_inode to get better packingJosef Bacik
Moving things around to give us better packing in the btrfs_inode. This reduces the size of our inode by 8 bytes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-13Merge branch 'btrfs-3.0' of git://github.com/chrismason/linuxLinus Torvalds
* 'btrfs-3.0' of git://github.com/chrismason/linux: Btrfs: make sure not to defrag extents past i_size Btrfs: fix recursive auto-defrag
2011-10-11Btrfs: make sure not to defrag extents past i_sizeChris Mason
The btrfs file defrag code will loop through the extents and force COW on them. But there is a concurrent truncate in the middle of the defrag, it might end up defragging the same range over and over again. The problem is that writepage won't go through and do anything on pages past i_size, so the cow won't happen, so the file will appear to still be fragmented. defrag will end up hitting the same extents again and again. In the worst case, the truncate can actually live lock with the defrag because the defrag keeps creating new ordered extents which the truncate code keeps waiting on. The fix here is to make defrag check for i_size inside the main loop, instead of just once before the looping starts. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-10Btrfs: fix recursive auto-defragLi Zefan
Follow those steps: # mount -o autodefrag /dev/sda7 /mnt # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1 # sync # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc and then it'll go into a loop: writeback -> defrag -> writeback ... It's because writeback writes [8K, 200K] and then writes [0, 8K]. I tried to make writeback know if the pages are dirtied by defrag, but the patch was a bit intrusive. Here I simply set writeback_index when we defrag a file. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-03Merge branch 'btrfs-3.0' of git://github.com/chrismason/linuxLinus Torvalds
* 'btrfs-3.0' of git://github.com/chrismason/linux: Btrfs: force a page fault if we have a shorty copy on a page boundary
2011-10-02btrfs: use readahead API for scrubArne Jansen
Scrub uses a simple tree-enumeration to bring the relevant portions of the extent- and csum-tree into the page cache before starting the scrub-I/O. This is now replaced by using the new readahead-API. During readahead the scrub is being accounted as paused, so it won't hold off transaction commits. This change raises the average disk bandwith utilisation on my test volume from 70% to 90%. On another volume, the time for a test run went down from 89s to 43s. Changes v5: - reada1/2 are now of type struct reada_control * Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: hooks for readaheadArne Jansen
This adds the hooks needed for readahead. In the readpage_end_io_hook, the extent state is checked for the EXTENT_READAHEAD flag. Only in this case the readahead hook is called, to keep the impact on non-ra as low as possible. Additionally, a hook for a failed IO is added, otherwise readahead would wait indefinitely for the extent to finish. Changes for v2: - eliminate race condition Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: initial readahead code and prototypesArne Jansen
This is the implementation for the generic read ahead framework. To trigger a readahead, btrfs_reada_add must be called. It will start a read ahead for the given range [start, end) on tree root. The returned handle can either be used to wait on the readahead to finish (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach). The read ahead works as follows: On btrfs_reada_add, the root of the tree is inserted into a radix_tree. reada_start_machine will then search for extents to prefetch and trigger some reads. When a read finishes for a node, all contained node/leaf pointers that lie in the given range will also be enqueued. The reads will be triggered in sequential order, thus giving a big win over a naive enumeration. It will also make use of multi-device layouts. Each disk will have its on read pointer and all disks will by utilized in parallel. Also will no two disks read both sides of a mirror simultaneously, as this would waste seeking capacity. Instead both disks will read different parts of the filesystem. Any number of readaheads can be started in parallel. The read order will be determined globally, i.e. 2 parallel readaheads will normally finish faster than the 2 started one after another. Changes v2: - protect root->node by transaction instead of node_lock - fix missed branches: The readahead had a too simple check to determine if a branch from a node should be checked or not. It now also records the upper bound of each node to see if the requested RA range lies within. - use KERN_CONT to debug output, to avoid line breaks - defer reada_start_machine to worker to avoid deadlock Changes v3: - protect root->node by rcu Changes v5: - changed EIO-semantics of reada_tree_block_flagged - remove spin_lock from reada_control and make elems an atomic_t - remove unused read_total from reada_control - kill reada_key_cmp, use btrfs_comp_cpu_keys instead - use kref-style release functions where possible - return struct reada_control * instead of void * from btrfs_reada_add Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: state information for readaheadArne Jansen
Add state information for readahead to btrfs_fs_info and btrfs_device Changes v2: - don't wait in radix_trees - add own set of workers for readahead Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: add READAHEAD extent buffer flagArne Jansen
Add a READAHEAD extent buffer flag. Add a function to trigger a read with this flag set. Changes v2: - use extent buffer flags instead of extent state flags Changes v5: - adapt to changed read_extent_buffer_pages interface - don't return eb from reada_tree_block_flagged if it has CORRUPT flag set Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: add an extra wait mode to read_extent_buffer_pagesArne Jansen
read_extent_buffer_pages currently has two modes, either trigger a read without waiting for anything, or wait for the I/O to finish. The former also bails when it's unable to lock the page. This patch now adds an additional parameter to allow it to block on page lock, but don't wait for completion. Changes v5: - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and WAIT_PAGE_LOCK Change v6: - fix bug introduced in v5 Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-09-30Merge branch 'btrfs-3.0' into for-linusChris Mason
2011-09-30Btrfs: force a page fault if we have a shorty copy on a page boundaryJosef Bacik
A user reported a problem where ceph was getting into 100% cpu usage while doing some writing. It turns out it's because we were doing a short write on a not uptodate page, which means we'd fall back at one page at a time and fault the page in. The problem is our position is on the page boundary, so our fault in logic wasn't actually reading the page, so we'd just spin forever or until the page got read in by somebody else. This will force a readpage if we end up doing a short copy. Alexandre could reproduce this easily with ceph and reports it fixes his problem. I also wrote a reproducer that no longer hangs my box with this patch. Thanks, Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-29btrfs: integrating raid-repair and scrub-fixup-nodatasumJan Schmidt
This ties nodatasum fixup in scrub together with raid repair patches. While both series are working fine alone, scrub will report uncorrectable errors if they occur in a nodatasum extent *and* the page is in the page cache. Previously, we would have triggered readpage to find good data and do the repair. However, readpage wouldn't read anything in the case where the page is up to date in the cache. So, we simply take that good data we have and call repair_io_failure directly (unless the page in the cache is dirty). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: Moved repair code from inode.c to extent_io.cJan Schmidt
The raid-retry code in inode.c can be generalized so that it works for metadata as well. Thus, this patch moves it to extent_io.c and makes the raid-retry code a raid-repair code. Repair works that way: Whenever a read error occurs and we have more mirrors to try, note the failed mirror, and retry another. If we find a good one, check if we did note a failure earlier and if so, do not allow the read to complete until after the bad sector was written with the good data we just fetched. As we have the extent locked while reading, no one can change the data in between. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: Put mirror_num in bi_bdevJan Schmidt
The error correction code wants to make sure that only the bad mirror is rewritten. Thus, we need to know which mirror is the bad one. I did not find a more apropriate field than bi_bdev. But I think using this is fine, because it is modified by the block layer, anyway, and should not be read after the bio returned. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: Do not use bio->bi_bdev after submissionJan Schmidt
The block layer modifies bio->bi_bdev and bio->bi_sector while working on the bio, they do _not_ come back unmodified in the completion callback. To call add_page, we need at least some bi_bdev set, which is why the code was working, previously. With this patch, we use the latest_bdev from fsinfo instead of the leftover in the bio. This gives us the possibility to use the bi_bdev field for another purpose. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: btrfs_multi_bio replaced with btrfs_bioJan Schmidt
btrfs_bio is a bio abstraction able to split and not complete after the last bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio tracks the mirror_num used to read data which can be used for error correction purposes. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: new ioctls to do logical->inode and inode->path resolvingJan Schmidt
these ioctls make use of the new functions initially added for scrub. they return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and all paths belonging to an inode (BTRFS_IOC_INO_PATHS). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs scrub: add fixup code for errors on nodatasum filesJan Schmidt
This removes a FIXME comment and introduces the first part of nodatasum fixup: It gets the corresponding inode for a logical address and triggers a regular readpage for the corrupted sector. Once we have on-the-fly error correction our error will be automatically corrected. The correction code is expected to clear the newly introduced EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead of "uncorrectable" eventually. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs scrub: use int for mirror_num, not u64Jan Schmidt
the rest of the code uses int mirror_num, and so should scrub Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: add mirror_num to extent_read_full_pageJan Schmidt
Currently, extent_read_full_page always assumes we are trying to read mirror 0, which generally is the best we can do. To add flexibility, pass it as a parameter. This will be needed by scrub fixup code. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs scrub: bugfix: mirror_num off by oneJan Schmidt
Fix the mirror_num determination in scrub_stripe. The rest of the scrub code did not use mirror_num for anything important and that error went unnoticed. The nodatasum fixup patch of this set depends on a correct mirror_num. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs scrub: print paths of corrupted filesJan Schmidt
While scrubbing, we may encounter various errors. Previously, a logical address was printed to the log only. Now, all paths belonging to that address are resolved and printed separately. That should work for hardlinks as well as reflinks. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs scrub: added unverified_errorsJan Schmidt
In normal operation, scrub is reading data sequentially in large portions. In case of an i/o error, we try to find the corrupted area(s) by issuing page sized read requests. With this commit we increment the unverified_errors counter if all of the small size requests succeed. Userland patches carrying such conspicous events to the administrator should already be around. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29btrfs: added helper functions to iterate backrefsJan Schmidt
These helper functions iterate back references and call a function for each backref. There is also a function to resolve an inode to a path in the file system. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>