summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)Author
2011-11-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: remove free-space-cache.c WARN during log replay Btrfs: sectorsize align offsets in fiemap Btrfs: clear pages dirty for io and set them extent mapped Btrfs: wait on caching if we're loading the free space cache Btrfs: prefix resize related printks with btrfs: btrfs: fix stat blocks accounting Btrfs: avoid unnecessary bitmap search for cluster setup Btrfs: fix to search one more bitmap for cluster setup btrfs: mirror_num should be int, not u64 btrfs: Fix up 32/64-bit compatibility for new ioctls Btrfs: fix barrier flushes Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
2011-11-20Btrfs: wait on caching if we're loading the free space cacheJosef Bacik
We've been hitting panics when running xfstest 13 in a loop for long periods of time. And actually this problem has always existed so we've been hitting these things randomly for a while. Basically what happens is we get a thread coming into the allocator and reading the space cache off of disk and adding the entries to the free space cache as we go. Then we get another thread that comes in and tries to allocate from that block group. Since block_group->cached != BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because if we're doing the old slow way of caching we don't want to hold people up and wait for everything to finish. The problem with this is we could end up discarding the space cache at some arbitrary point in the future, which means we could very well end up allocating space that is either bad, or when the real caching happens it could end up thinking the space isn't in use when it really is and cause all sorts of other problems. The solution is to add a new flag to indicate we are loading the free space cache from disk, and always try to cache the block group if cache->cached != BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else who tries to allocate from the block group will have to wait until it's finished to make sure it completes successfully. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: rename the option to nospace_cache Btrfs: handle bio_add_page failure gracefully in scrub Btrfs: fix deadlock caused by the race between relocation Btrfs: only map pages if we know we need them when reading the space cache Btrfs: fix orphan backref nodes Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush} Btrfs: fix unreleased path in btrfs_orphan_cleanup() Btrfs: fix no reserved space for writing out inode cache Btrfs: fix nocow when deleting the item Btrfs: tweak the delayed inode reservations again Btrfs: rework error handling in btrfs_mount() Btrfs: close devices on all error paths in open_ctree() Btrfs: avoid null dereference and leaks when bailing from open_ctree() Btrfs: fix subvol_name leak on error in btrfs_mount() Btrfs: fix memory leak in btrfs_parse_early_options() Btrfs: fix our reservations for updating an inode when completing io Btrfs: fix oops on NULL trans handle in btrfs_truncate btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
2011-11-10Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}Miao Xie
btrfs_block_rsv_add{, _noflush}() have similar code, so abstract that code. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08Btrfs: fix our reservations for updating an inode when completing ioJosef Bacik
People have been reporting ENOSPC crashes in finish_ordered_io. This is because we try to steal from the delalloc block rsv to satisfy a reservation to update the inode. The problem with this is we don't explicitly save space for updating the inode when doing delalloc. This is kind of a problem and we've gotten away with this because way back when we just stole from the delalloc reserve without any questions, and this worked out fine because generally speaking the leaf had been modified either by the mtime update when we did the original write or because we just updated the leaf when we inserted the file extent item, only on rare occasions had the leaf not actually been modified, and that was still ok because we'd just use a block or two out of the over-reservation that is delalloc. Then came the delayed inode stuff. This is amazing, except it wants a full reservation for updating the inode since it may do it at some point down the road after we've written the blocks and we have to recow everything again. This worked out because the delayed inode stuff just stole from the global reserve, that is until recently when I changed that because it caused other problems. So here we are, we're doing everything right and being screwed for it. So take an extra reservation for the inode at delalloc reservation time and carry it through the life of the delalloc reservation. If we need it we can steal it in the delayed inode stuff. If we have already stolen it try and do a normal metadata reservation. If that fails try to steal from the delalloc reservation. If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to solve this and in the meantime we'll steal from the global reserve. With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see any problems. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits) Btrfs: check for a null fs root when writing to the backup root log Btrfs: fix race during transaction joins Btrfs: fix a potential btrfs_bio leak on scrub fixups Btrfs: rename btrfs_bio multi -> bbio for consistency Btrfs: stop leaking btrfs_bios on readahead Btrfs: stop the readahead threads on failed mount Btrfs: fix extent_buffer leak in the metadata IO error handling Btrfs: fix the new inspection ioctls for 32 bit compat Btrfs: fix delayed insertion reservation Btrfs: ClearPageError during writepage and clean_tree_block Btrfs: be smarter about committing the transaction in reserve_metadata_bytes Btrfs: make a delayed_block_rsv for the delayed item insertion Btrfs: add a log of past tree roots btrfs: separate superblock items out of fs_info Btrfs: use the global reserve when truncating the free space cache inode Btrfs: release metadata from global reserve if we have to fallback for unlink Btrfs: make sure to flush queued bios if write_cache_pages waits Btrfs: fix extent pinning bugs in the tree log Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN Btrfs: don't wait as long for more batches during SSD log commit ...
2011-11-06Merge git://git.jan-o-sch.net/btrfs-unstable into integrationChris Mason
Conflicts: fs/btrfs/Makefile fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Btrfs: fix delayed insertion reservationJosef Bacik
We all keep getting those stupid warnings from use_block_rsv when running stress.sh, and it's because the delayed insertion stuff is being stupid. It's not the delayed insertion stuffs fault, it's all just stupid. When marking an inode dirty for oh say updating the time on it, we just do a btrfs_join_transaction, which doesn't reserve any space. This is stupid because we're going to have to have space reserve to make this change, but we do it because it's fast because chances are we're going to call it over and over again and it doesn't matter. Well thanks to the delayed insertion stuff this is mostly the case, so we do actually need to make this reservation. So if trans->bytes_reserved is 0 then try to do a normal reservation. If not return ENOSPC which will make the btrfs_dirty_inode start a proper transaction which will let it do the whole ENOSPC dance and reserve enough space for the delayed insertion to steal the reservation from the transaction. The other stupid thing we do is not reserve space for the inode when writing to the thing. Usually this is ok since we have to update the time so we'd have already done all this work before we get to the endio stuff, so it doesn't matter. But this is stupid because we could write the data after the transaction commits where we changed the mtime of the inode so we have to cow all the way down to the inode anyway. This used to be masked by the delalloc reservation stuff, but because we delay the update it doesn't get masked in this case. So again the delayed insertion stuff bites us in the ass. So if our trans->block_rsv is delalloc, just steal the reservation from the delalloc reserve. Hopefully this won't bite us in the ass, but I've said that before. With this patch stress.sh no longer spits out those stupid warnings (famous last words). Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Btrfs: be smarter about committing the transaction in reserve_metadata_bytesJosef Bacik
Because of the overcommit stuff I had to make it so that we committed the transaction all the time in reserve_metadata_bytes in case we had overcommitted because of delayed items. This was because previously we had no way of knowing how much space was reserved for delayed items. Now that we have the delayed_block_rsv we can check it to see if committing the transaction would get us anywhere. This patch breaks out the committing logic into a helper function that will check to see if committing the transaction would free enough space for us to get anything done. With this patch xfstests 83 goes from taking 445 seconds to taking 28 seconds on my box. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Btrfs: make a delayed_block_rsv for the delayed item insertionJosef Bacik
I've been hitting warnings in use_block_rsv when running the delayed insertion stuff. It's because we will readjust global block rsv based on what is in use, which means we could end up discarding reservations that are for the delayed insertion stuff. So instead create a seperate block rsv for the delayed insertion stuff. This will also make it easier to debug problems with the delayed insertion reservations since we will know that only the delayed insertion code touches this block_rsv. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06btrfs: separate superblock items out of fs_infoDavid Sterba
fs_info has now ~9kb, more than fits into one page. This will cause mount failure when memory is too fragmented. Top space consumers are super block structures super_copy and super_for_commit, ~2.8kb each. Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64) Add a wrapper for freeing fs_info and all of it's dynamically allocated members. Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06Btrfs: fix extent pinning bugs in the tree logChris Mason
The tree log had two important bugs that could cause corruptions after a crash. Sometimes we were allowing tree log blocks to be reused after the tree log was committed but before the transaction commit was done. This allowed a future metadata write to overwrite the tree log data. It is fixed by adding a new variant of freeing reserved extents that always pins them. Credit goes to Stefan Behrens and Arne Jansen for many many hours spent tracking this bug down. During tree log replay, we do a pass through the tree log and pin all the extents we find. This makes sure the replay code won't go in and use any of those blocks for new allocations during replay. The problem is the free space cache isn't honoring these pinned extents. So the allocator can end up handing them out, leading to all kinds of problems during replay. The fix here is to force any free space cache to load while we pin the extents, and then to make sure we remove the pinned extents from the free space rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com> Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-10-31writeback: Add a 'reason' to wb_writeback_workCurt Wohlgemuth
This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons. The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events. And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-24btrfs: ratelimit WARN_ON in use_block_rsvDavid Sterba
The WARN_ON under some circumstances heavily polute log and slow down the machine. This is just a safety, as the warning should be fixed by another patch, nevertheless, it still pops up during testing. Signed-off-by: David Sterba <dsterba@suse.cz>
2011-10-20Btrfs: fix race between multi-task space allocation and caching spaceMiao Xie
The task may fail to get free space though it is enough when multi-task space allocation and caching space happen at the same time. Task1 Caching Thread Task2 ------------------------------------------------------------------------ find_free_extent The space has not be cached, and start caching thread. And wait for it. cache space, if the space is > 2MB wake up Task1 find_free_extent get all the space that is cached. try to allocate space, but there is no space now. trigger BUG_ON() The message is following: btrfs allocation failed flags 1, wanted 4096 space_info has 1040187392 free, is not full space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0 block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved block group has cluster?: no 0 blocks of free space at or bigger than bytes is block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved block group has cluster?: no 0 blocks of free space at or bigger than bytes is ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:835! [<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs] [<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108 [<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs] [<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2 [<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs] [<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs] [<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103 [<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs] [<ffffffff811380fa>] ? fsnotify+0x236/0x266 [<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs] [<ffffffff810cc215>] do_writepages+0x1c/0x25 [<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50 [<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51 [<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs] [<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65 [<ffffffff8112d150>] vfs_fsync_range+0x18/0x21 [<ffffffff8112d170>] vfs_fsync+0x17/0x19 [<ffffffff8112d316>] do_fsync+0x29/0x3e [<ffffffff8112d348>] sys_fsync+0xb/0xf [<ffffffff81468352>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs] We fix this bug by trying to allocate the space again if there are block groups in caching. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-10-20Btrfs: pass the correct root to lookup_free_space_inode()Ilya Dryomov
Free space items are located in tree of tree roots, not in the extent tree. It didn't pop up because lookup_free_space_inode() grabs the inode all the time instead of actually searching the tree. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-10-19Btrfs: if we have a lot of pinned space, commit the transactionJosef Bacik
Mitch kept hitting a panic because he was getting ENOSPC. One of my previous patches makes it so we are much better at not allocating new metadata chunks. Unfortunately coupled with the overcommit patch this works us into a bit of a problem if we are removing a bunch of space and end up chewing up all of our space with pinned extents. We can allocate chunks fine and overflow is ok, but the only way to reclaim this space is to commit the transaction. So if we go to overcommit, first check and see how much pinned space we have. If we have more than 80% of the free space chewed up with pinned extents, just commit the transaction, this will free up enough space for our reservation and we won't have this problem anymore. With this patch Mitch's test doesn't blow up anymore. Thanks, Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: seperate out btrfs_block_rsv_check out into 2 different functionsJosef Bacik
Currently btrfs_block_rsv_check does 2 things, it will either refill a block reserve like in the truncate or refill case, or it will check to see if there is enough space in the global reserve and possibly refill it. However because of overcommit we could be well overcommitting ourselves just to try and refill the global reserve, when really we should just be committing the transaction. So breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill will try to reserve more metadata if it can and btrfs_block_rsv_check will not, it will only tell you if the factor of the total space is still reserved. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: release trans metadata bytes before flushing delayed refsJosef Bacik
We started setting trans->block_rsv = NULL to allow the delayed refs flushing stuff to use the right block_rsv and then just made btrfs_trans_release_metadata() unconditionally use the trans block rsv. The problem with this is we need to reserve some space in the transaction and then migrate it to the global block rsv, so we need to be able to free that out properly. So instead just move btrfs_trans_release_metadata() before the delayed ref flushing and use trans->block_rsv for the freeing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: allow shrink_delalloc flush the needed reclaimed pagesJosef Bacik
Currently we only allow a maximum of 2 megabytes of pages to be flushed at a time. This was ok before, but now we have overcommit which will screw us in a heartbeat if we are quickly filling the disk. So instead pick either 2 megabytes or the number of pages we need to reclaim to be safe again, which ever is larger. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: wait for ordered extents if we're in trouble when shrinking delallocJosef Bacik
The only way we actually reclaim delalloc space is waiting for the IO to completely finish. Usually we kick off a bunch of IO and wait for a little bit and hope we can make our reservation, and usually this works out pretty well. With overcommit however we can get seriously underwater if we're filling up the disk quickly, so we need to be able to force the delalloc shrinker to wait for the ordered IO to finish to give us a better chance of actually reclaiming enough space to get our reservation. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: don't check bytes_pinned to determine if we should commit the transactionJosef Bacik
Before the only reason to commit the transaction to recover space in reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our reservation. But now we have the delayed inode stuff which will hold it's reservations until we commit the transaction. So say we max out our reservation by creating a bunch of files but don't have any pinned bytes we will ENOSPC out early even though we could commit the transaction and get that space back. So now just unconditionally commit the transaction since currently there is no way to know how much metadata space is being reserved by delayed inode stuff. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: wait for ordered extents if we didn't reclaim enoughJosef Bacik
I noticed recently that my overcommit patch was causing one of my enospc tests to fail 25% of the time with early ENOSPC. This is because my overcommit patch was letting us go way over board, but it wasn't waiting long enough to let the delalloc shrinker do it's job. The problem is we just start writeback and wait a little bit hoping we flush enough, but we only free up delalloc space by having the writes complete all the way. We do this by waiting for ordered extents, which we do but only if we already free'd enough for the reservation, which isn't right, we should flush ordered extents if we didn't reclaim enough in case that will push us over the edge. With this patch I've not seen a failure in this enospc test after running it in a loop for an hour. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: inline checksums into the disk free space cacheJosef Bacik
Yeah yeah I know this is how we used to do it and then I changed it, but damnit I'm changing it back. The fact is that writing out checksums will modify metadata, which could cause us to dirty a block group we've already written out, so we have to truncate it and all of it's checksums and re-write it which will write new checksums which could dirty a blockg roup that has already been written and you see where I'm going with this? This can cause unmount or really anything that depends on a transaction to commit to take it's sweet damned time to happen. So go back to the way it was, only this time we're specifically setting NODATACOW because we can't go through the COW pathway anyway and we're doing our own built-in cow'ing by truncating the free space cache. The other new thing is once we truncate the old cache and preallocate the new space, we don't need to do that song and dance at all for the rest of the transaction, we can just overwrite the existing space with the new cache if the block group changes for whatever reason, and the NODATACOW will let us do this fine. So keep track of which transaction we last cleared our cache in and if we cleared it in this transaction just say we're all setup and carry on. This survives xfstests and stress.sh. The inode cache will continue to use the normal csum infrastructure since it only gets written once and there will be no more modifications to the fs tree in a transaction commit. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: take overflow into account in reserving spaceJosef Bacik
My overcommit stuff can be a little racy when we're filling up the disk with fs_mark and we overcommit into things that quickly get used up for data. So use num_bytes to see if we have enough available space so we're less likely to overcommit ourselves out of the ability to make reservations. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: introduce mount option no_space_cacheJosef Bacik
Some users have requested this and I've found I needed a way to disable cache loading without actually clearing the cache, so introduce the no_space_cache option. Before we check the super blocks cache generation field and if it was populated we always turned space caching on. Now we check this and set the space cache option on, and then parse the mount options so that if we want it off it get's turned off. Then we check the mount option all the places we do the caching work instead of checking the super's cache generation. This makes things more consistent and lets us turn space caching off. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: allow us to overcommit our enospc reservationsJosef Bacik
One of the things that kills us is the fact that our ENOSPC reservations are horribly over the top in most normal cases. There isn't too much that can be done about this because when we are completely full we really need them to work like this so we don't under reserve. However if there is plenty of unallocated chunks on the disk we can use that to gauge how much we can overcommit. So this patch adds chunk free space accounting so we always know how much unallocated space we have. Then if we fail to make a reservation within our allocated space, check to see if we can overcommit. In the normal flushing case (like with delalloc metadata reservations) we'll take the free space and divide it by 2 if our metadata profile is setup for DUP or any of those, and then divide it by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing case (we really need this reservation now!) we only limit ourselves to half of the free space. This makes this fio test [torrent] filename=torrent-test rw=randwrite size=4g ioengine=sync directory=/mnt/btrfs-test go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB file system. This doesn't seem to break my other enospc tests, but could really use some more testing as this is a super scary change. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: check unused against how much space we actually wantJosef Bacik
There is a bug that may lead to early ENOSPC in our reservation code. We've been checking against num_bytes which may be above and beyond what we want to actually reserve, which could give us a false ENOSPC. Fix this by making sure the unused space is above how much we want to reserve and not how much we're trying to flush. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: delay iput when deleting a block groupJosef Bacik
I kept getting warnings from evict because we were calling btrfs_start_transaction() with a transaction already started when doing a balance. This is because we remove a block group which requires a transaction, and the put the last reference on the cache inode. Instead of doing this we need to delay the iput so it is done not within a transaction having started. This gets rid of our warnings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: stop passing a trans handle all around the reservation codeJosef Bacik
The only thing that we need to have a trans handle for is in reserve_metadata_bytes and thats to know how much flushing we can do. So instead of passing it around, just check current->journal_info for a trans_handle so we know if we can commit a transaction to try and free up space or not. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: don't get the block_rsv in btrfs_free_tree_blockJosef Bacik
Since the durable block rsv stuff has been killed there is no need to get the block_rsv in btrfs_free_tree_block anymore. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: use the transactions block_rsv for the csum rootJosef Bacik
The alloc warnings everybody has been seeing is because we have been reserving space for csums, but we weren't actually using that space. So make get_block_rsv() return the trans->block_rsv if we're modifying the csum root. Also set the trans->block_rsv to NULL so that if we modify the csum root when running delayed ref's that comes out of the global reserve like it's supposed to. With this patch I'm not seeing those alloc warnings anymore. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: handle enospc accounting for free space inodesJosef Bacik
Since free space inodes now use normal checksumming we need to make sure to account for their metadata use. So reserve metadata space, and then if we fail to write out the metadata we can just release it, otherwise it will be freed up when the io completes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: don't increase the block_rsv's size when emergency allocating spaceJosef Bacik
If we have to emergency reserve space we need to not increase the block_rsv size, otherwise we'll leak space. Take for instance delalloc, say we reserve 4k, and we use that 4k, and then we have to emergency allocate another 4k, we bump the size up to 8k, however we've only accounted for 4k in reservations in all of our supporting logic, so we'll go to free the 4k and end up having a size of 4k, which will cause us to later not free as much space. I saw this doing testing where I wasn't reserving enough space for something but was still leaking space, very frustrating. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: fix space leak when we fail to make an allocationJosef Bacik
When changing back to using a spin_lock to protect the extent counters I decided that since we would only be dropping our original extent, it was ok to just drop the extent and return. However since somebody else could have come in and done a reservation, we need to do the normal song and dance to clear the reservation out properly. So calculate how much space we need to free, and then subtract what we just attempted to reserve. If it's more then we know we need to drop those bytes from the delalloc block rsv. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_checkJosef Bacik
If you run xfstest 224 it you will get lots of messages about not being able to delete inodes and that they will be cleaned up next mount. This is because btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to flush, so if there was not enough space, it simply failed. But in truncate and evict case we could easily flush space to try and get enough space to do our work, so make btrfs_block_rsv_check take a flush argument to pass down to reserve_metadata_bytes. Now xfstests 224 runs fine without all those complaints. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill btrfs_truncate_reserve_metadataJosef Bacik
Since we've optimized the truncate path, we no longer require this function. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: don't try to commit in btrfs_block_rsv_checkJosef Bacik
We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot because we have a transaction open it will return EAGAIN, so we do not need to try and commit the transaction again. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill unused parts of block_rsvJosef Bacik
The priority and refill_used flags are not used anymore, and neither is the usage counter, so just remove them from btrfs_block_rsv. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill the durable block rsv stuffJosef Bacik
This is confusing code and isn't used by anything anymore, so delete it. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: calculate checksum space correctlyJosef Bacik
We have not been reserving enough space for checksums. We were just reserving bytes for the checksum items themselves, we were not taking into account having to cow the tree and such. This patch adds a csum_bytes counter to the inode for keeping track of the number of bytes outstanding we have for checksums. Then we calculate how many leaves would be required for the checksums we are given and use that to reserve space. This adds a significant amount of bytes to our reservations, but we will handle this later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: use bytes_may_use for all ENOSPC reservationsJosef Bacik
We have been using bytes_reserved for metadata reservations, which is wrong since we use that to keep track of outstanding reservations from the allocator. This resulted in us doing a lot of silly things to make sure we don't allocate a bunch of metadata chunks since we never had a real view of how much space was actually in use by metadata. This passes Arne's enospc test and xfstests as well as my own enospc tests. Hopefully this will get us moving in the right direction. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: kill reserved_bytes in inodeJosef Bacik
reserved_bytes is not used for anything in the inode, remove it. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-09-29btrfs: btrfs_multi_bio replaced with btrfs_bioJan Schmidt
btrfs_bio is a bio abstraction able to split and not complete after the last bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio tracks the mirror_num used to read data which can be used for error correction purposes. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-08-21Btrfs: fix 64 bit divide problemJosef Bacik
This fixes a regression introduced by commit cdcb725c05fe ("Btrfs: check if there is enough space for balancing smarter"). We can't do 64-bit divides on 32-bit architectures. In cases where we need to divide/multiply by 2 we should just left/right shift respectively, and in cases where theres N number of devices use do_div. Also make the counters u64 to match up with rw_devices. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Acked-and-tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-16Btrfs: forced readonly when btrfs_drop_snapshot() failsTsutomu Itoh
The filesystem turns readonly instead of returning the error to the caller when detected error in btrfs_drop_snapshot(). and, because the caller doesn't check the error, the function type is changed to 'void'. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16Btrfs: check if there is enough space for balancing smarterliubo
When checking if there is enough space for balancing a block group, since we do not take raid types into consideration, we do not account corrent amounts of space that we needed. This makes us do some extra work before we get ENOSPC. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16Btrfs: detect wether a device supports discardJosef Bacik
We have a problem where if a user specifies discard but doesn't actually support it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem because this gets called (in a fashion) from the tree log recovery code, which has a nice little BUG_ON(ret) after it, which causes us to fail the tree log replay. So instead detect wether our devices support discard when we're adding them and then don't issue discards if we know that the device doesn't support it. And just for good measure set ret = 0 in btrfs_issue_discard just in case we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: don't print the leaf if we had an errorJosef Bacik
In __btrfs_free_extent we will print the leaf if we fail to find the extent we wanted, but the problem is if we get an error we won't have a leaf so often this leads to a NULL pointer dereference and we lose the error that actually occurred. So only print the leaf if ret > 0, which means we didn't find the item we were looking for but we didn't error either. This way the error is preserved. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01Btrfs: fix oops while writing data to SSD partitionsliubo
Here I have a two SSD-partitions btrfs, and they are defaultly set to "data=raid0, metadata=raid1", then I try to fill my btrfs partition till "No space left on device", via "dd if=/dev/zero of=/mnt/btrfs/tmp". I get an oops panic from kernel BUG at fs/btrfs/extent-tree.c:5199!, which refers to find_free_extent's BUG_ON(index != get_block_group_index(block_group)); In SSD mode, in order to find enough space to alloc, we may check the block_group cache which has been checked sometime before, but the index is not updated, where it hits the BUG_ON. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Acked-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>