summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2011-04-08xfs: push the AIL from memory reclaim and periodic syncDave Chinner
When we are short on memory, we want to expedite the cleaning of dirty objects. Hence when we run short on memory, we need to kick the AIL flushing into action to clean as many dirty objects as quickly as possible. To implement this, sample the lsn of the log item at the head of the AIL and use that as the push target for the AIL flush. Further, we keep items in the AIL that are dirty that are not tracked any other way, so we can get objects sitting in the AIL that don't get written back until the AIL is pushed. Hence to get the filesystem to the idle state, we might need to push the AIL to flush out any remaining dirty objects sitting in the AIL. This requires the same push mechanism as the reclaim push. This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to match the new xfs_ail_max_lsn() function introduced in this patch. Similarly for xfs_trans_ail_push -> xfs_ail_push. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08xfs: clean up code layout in xfs_trans_ail.cDave Chinner
This patch rearranges the location of functions in xfs_trans_ail.c to remove the need for forward declarations of those functions in preparation for adding new functions without the need for forward declarations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08xfs: convert the xfsaild threads to a workqueueDave Chinner
Similar to the xfssyncd, the per-filesystem xfsaild threads can be converted to a global workqueue and run periodically by delayed works. This makes sense for the AIL pushing because it uses variable timeouts depending on the work that needs to be done. By removing the xfsaild, we simplify the AIL pushing code and remove the need to spread the code to implement the threading and pushing across multiple files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08xfs: introduce background inode reclaim workDave Chinner
Background inode reclaim needs to run more frequently that the XFS syncd work is run as 30s is too long between optimal reclaim runs. Add a new periodic work item to the xfs syncd workqueue to run a fast, non-blocking inode reclaim scan. Background inode reclaim is kicked by the act of marking inodes for reclaim. When an AG is first marked as having reclaimable inodes, the background reclaim work is kicked. It will continue to run periodically untill it detects that there are no more reclaimable inodes. It will be kicked again when the first inode is queued for reclaim. To ensure shrinker based inode reclaim throttles to the inode cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the background inode reclaim so that when we are low on memory we are trying to reclaim inodes as efficiently as possible. This kick shoul d not be necessary, but it will protect against failures to kick the background reclaim when inodes are first dirtied. To provide the rate throttling, make the shrinker pass do synchronous inode reclaim so that it blocks on inodes under IO. This means that the shrinker will reclaim inodes rather than just skipping over them, but it does not adversely affect the rate of reclaim because most dirty inodes are already under IO due to the background reclaim work the shrinker kicked. These two modifications solve one of the two OOM killer invocations Chris Mason reported recently when running a stress testing script. The particular workload trigger for the OOM killer invocation is where there are more threads than CPUs all unlinking files in an extremely memory constrained environment. Unlike other solutions, this one does not have a performance impact on performance when memory is not constrained or the number of concurrent threads operating is <= to the number of CPUs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08xfs: convert ENOSPC inode flushing to use new syncd workqueueDave Chinner
On of the problems with the current inode flush at ENOSPC is that we queue a flush per ENOSPC event, regardless of how many are already queued. Thi can result in hundreds of queued flushes, most of which simply burn CPU scanned and do no real work. This simply slows down allocation at ENOSPC. We really only need one active flush at a time, and we can easily implement that via the new xfs_syncd_wq. All we need to do is queue a flush if one is not already active, then block waiting for the currently active flush to complete. The result is that we only ever have a single ENOSPC inode flush active at a time and this greatly reduces the overhead of ENOSPC processing. On my 2p test machine, this results in tests exercising ENOSPC conditions running significantly faster - 042 halves execution time, 083 drops from 60s to 5s, etc - while not introducing test regressions. This allows us to remove the old xfssyncd threads and infrastructure as they are no longer used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08xfs: introduce a xfssyncd workqueueDave Chinner
All of the work xfssyncd does is background functionality. There is no need for a thread per filesystem to do this work - it can al be managed by a global workqueue now they manage concurrency effectively. Introduce a new gglobal xfssyncd workqueue, and convert the periodic work to use this new functionality. To do this, use a delayed work construct to schedule the next running of the periodic sync work for the filesystem. When the sync work is complete, queue a new delayed work for the next running of the sync work. For laptop mode, we wait on completion for the sync works, so ensure that the sync work queuing interface can flush and wait for work to complete to enable the work queue infrastructure to replace the current sequence number and wakeup that is used. Because the sync work does non-trivial amounts of work, mark the new work queue as CPU intensive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08xfs: fix extent format buffer allocation sizeDave Chinner
When formatting an inode item, we have to allocate a separate buffer to hold extents when there are delayed allocation extents on the inode and it is in extent format. The allocation size is derived from the in-core data fork representation, which accounts for delayed allocation extents, while the on-disk representation does not contain any delalloc extents. As a result of this mismatch, the allocated buffer can be far larger than needed to hold the real extent list which, due to the fact the inode is in extent format, is limited to the size of the literal area of the inode. However, we can have thousands of delalloc extents, resulting in an allocation size orders of magnitude larger than is needed to hold all the real extents. Fix this by limiting the size of the buffer being allocated to the size of the literal area of the inodes in the filesystem (i.e. the maximum size an inode fork can grow to). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
2011-03-30xfs: fix unreferenced var error in xfs_buf.cDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-03-29fs: don't use igrab() while holding i_lockDave Chinner
Fix the incorrect use of igrab() inside the i_lock in NFS and Ceph‥ If we are already holding the i_lock, we have a reference to the inode so we can safely use ihold() to gain an extra reference. This avoids hangs due to lock recursion on the i_lock now that the inode_lock is gone and igrab() uses the i_lock itself. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: Ryan Mallon <ryan@bluewatersys.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-28Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: stop using the page cache to back the buffer cache xfs: register the inode cache shrinker before quotachecks xfs: xfs_trans_read_buf() should return an error on failure xfs: introduce inode cluster buffer trylocks for xfs_iflush vmap: flush vmap aliases when mapping fails xfs: preallocation transactions do not need to be synchronous Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_buf.c due to plug removal.
2011-03-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: write lock requested keys eCryptfs: move ecryptfs_find_auth_tok_for_sig() call before mutex_lock eCryptfs: verify authentication tokens before their use eCryptfs: modified size of keysig in the ecryptfs_key_sig structure eCryptfs: removed num_global_auth_toks from ecryptfs_mount_crypt_stat eCryptfs: ecryptfs_keyring_auth_tok_for_sig() bug fix eCryptfs: Unlock page in write_begin error path ecryptfs: modify write path to encrypt page in writepage eCryptfs: Remove ECRYPTFS_NEW_FILE crypt stat flag eCryptfs: Remove unnecessary grow_file() function
2011-03-28Merge branch 'for-linus-unmerged' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits) Btrfs: fix __btrfs_map_block on 32 bit machines btrfs: fix possible deadlock by clearing __GFP_FS flag btrfs: check link counter overflow in link(2) btrfs: don't mess with i_nlink of unlocked inode in rename() Btrfs: check return value of btrfs_alloc_path() Btrfs: fix OOPS of empty filesystem after balance Btrfs: fix memory leak of empty filesystem after balance Btrfs: fix return value of setflags ioctl Btrfs: fix uncheck memory allocations btrfs: make inode ref log recovery faster Btrfs: add btrfs_trim_fs() to handle FITRIM Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP Btrfs: make update_reserved_bytes() public btrfs: return EXDEV when linking from different subvolumes Btrfs: Per file/directory controls for COW and compression Btrfs: add datacow flag in inode flag btrfs: use GFP_NOFS instead of GFP_KERNEL Btrfs: check return value of read_tree_block() btrfs: properly access unaligned checksum buffer ... Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in the block layer.
2011-03-28Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (39 commits) Treat writes as new when holes span across page boundaries fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS. ocfs2/dlm: Move kmalloc() outside the spinlock ocfs2: Make the left masklogs compat. ocfs2: Remove masklog ML_AIO. ocfs2: Remove masklog ML_UPTODATE. ocfs2: Remove masklog ML_BH_IO. ocfs2: Remove masklog ML_JOURNAL. ocfs2: Remove masklog ML_EXPORT. ocfs2: Remove masklog ML_DCACHE. ocfs2: Remove masklog ML_NAMEI. ocfs2: Remove mlog(0) from fs/ocfs2/dir.c ocfs2: remove NAMEI from symlink.c ocfs2: Remove masklog ML_QUOTA. ocfs2: Remove mlog(0) from quota_local.c. ocfs2: Remove masklog ML_RESERVATIONS. ocfs2: Remove masklog ML_XATTR. ocfs2: Remove masklog ML_SUPER. ocfs2: Remove mlog(0) from fs/ocfs2/heartbeat.c ocfs2: Remove mlog(0) from fs/ocfs2/slot_map.c ... Fix up trivial conflict in fs/ocfs2/super.c
2011-03-28Treat writes as new when holes span across page boundariesGoldwyn Rodrigues
When a hole spans across page boundaries, the next write forces a read of the block. This could end up reading existing garbage data from the disk in ocfs2_map_page_blocks. This leads to non-zero holes. In order to avoid this, mark the writes as new when the holes span across page boundaries. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de> Signed-off-by: jlbec <jlbec@evilplan.org>
2011-03-28Merge branch 'mlog_replace_for_39' of git://repo.or.cz/taoma-kernel into ↵Joel Becker
ocfs2-merge-window-fix
2011-03-28fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS.Rakib Mullick
When CONFIG_DEBUG_FS=y and CONFIG_OCFS2_FS_STATS=n, we get the following warning: fs/ocfs2/cluster/tcp.c:213:16: warning: ‘o2net_get_func_run_time’ defined but not used Since o2net_get_func_run_time is only called from o2net_update_recv_stats, so move it under CONFIG_OCFS2_FS_STATS. Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: jlbec <jlbec@evilplan.org>
2011-03-28Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Ensure that rpc_release_resources_task() can be called twice. NFS: Don't leak RPC clients in NFSv4 secinfo negotiation NFS: Fix a hang in the writeback path
2011-03-28Btrfs: fix __btrfs_map_block on 32 bit machinesChris Mason
Recent changes for discard support didn't compile, this fixes them not to try and % 64 bit numbers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28btrfs: fix possible deadlock by clearing __GFP_FS flagMiao Xie
Using the GFP_HIGHUSER_MOVABLE flag to allocate the metadata's page may cause deadlock. Task1 open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() wait for reclaiming shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. So the btree's page cache is different with the file's page cache, it can not allocate pages by GFP_HIGHUSER_MOVABLE flag, we must clear __GFP_FS flag in GFP_HIGHUSER_MOVABLE flag. Reported-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28btrfs: check link counter overflow in link(2)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28btrfs: don't mess with i_nlink of unlocked inode in rename()Al Viro
old_inode is not locked; it's not safe to play with its link count. Instead of bumping it and calling btrfs_unlink_inode(), add a variant of the latter that does not do btrfs_drop_nlink()/ btrfs_update_inode(), call it instead of btrfs_inc_nlink()/ btrfs_unlink_inode() and do btrfs_update_inode() ourselves. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: check return value of btrfs_alloc_path()Tsutomu Itoh
Adding the check on the return value of btrfs_alloc_path() to several places. And, some of callers are modified by this change. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: fix OOPS of empty filesystem after balanceliubo
btrfs will remove unused block groups after balance. When a empty filesystem is balanced, the block group with tag "DATA" may be dropped, and after umount and mount again, it will not find "DATA" space_info and lead to OOPS. So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS. Reported-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: fix memory leak of empty filesystem after balanceliubo
After Josef's patch(commit 3c14874acc71180553fb5aba528e3cf57c5b958b), btrfs will exclude super bytes when reading block groups(by marking a extent state UPTODATE). However, these bytes do not get freed while balance remove unused block groups, and we won't process those removed ones any more, when we do umount and unload the btrfs module, btrfs hits a memory leak. This patch add the missing free operation. Reproduce steps: $ mkfs.btrfs disk $ mount disk /mnt/btrfs -o loop $ btrfs filesystem balance /mnt/btrfs $ umount /mnt/btrfs $ rmmod btrfs Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: fix return value of setflags ioctlliubo
setflags ioctl should return error when any checks fail. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: fix uncheck memory allocationsYoshinori Sano
To make Btrfs code more robust, several return value checks where memory allocation can fail are introduced. I use BUG_ON where I don't know how to handle the error properly, which increases the number of using the notorious BUG_ON, though. Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28btrfs: make inode ref log recovery fasterliubo
When we recover from crash via write-ahead log tree and process the inode refs, for each btrfs_inode_ref item, we will 1) check if we already have a perfect match in fs/file tree, if we have, then we're done. 2) search the corresponding back reference in fs/file tree, and check all the names in this back reference to see if they are also in the log to avoid conflict corners. 3) recover the logged inode refs to fs/file tree. In current btrfs, however, - for 2)'s check, once is enough, since the checked back reference will remain unchanged after processing all the inode refs belonged to the key. - it has no need to do another 1) between 2) and 3). I've made a small test to show how it improves, $dd if=/dev/zero of=foobar bs=4K count=1 $sync $make 100 hard links continuously, like ln foobar link_i $fsync foobar $echo b > /proc/sysrq-trigger after reboot $time mount DEV PATH without patch: real 0m0.285s user 0m0.001s sys 0m0.009s with patch: real 0m0.123s user 0m0.000s sys 0m0.010s Changelog v1->v2: - fix double free - pointed by David Sterba Changelog v2->v3: - adjust free order Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: add btrfs_trim_fs() to handle FITRIMLi Dongyang
We take an free extent out from allocator, trim it, then put it back, but before we trim the block group, we should make sure the block group is cached, so plus a little change to make cache_block_group() run without a transaction. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytesLi Dongyang
Callers of btrfs_discard_extent() should check if we are mounted with -o discard, as we want to make fitrim to work even the fs is not mounted with -o discard. Also we should use REQ_DISCARD to map the free extent to get a full mapping, last we only return errors if 1. the error is not a EOPNOTSUPP 2. no device supports discard Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: make btrfs_map_block() return entire free extent for each device of ↵Li Dongyang
RAID0/1/10/DUP btrfs_map_block() will only return a single stripe length, but we want the full extent be mapped to each disk when we are trimming the extent, so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: make update_reserved_bytes() publicLi Dongyang
Make the function public as we should update the reserved extents calculations after taking out an extent for trimming. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28btrfs: return EXDEV when linking from different subvolumesMark Fasheh
btrfs_link returns EPERM if a cross-subvolume link is attempted. However, in this case I believe EXDEV to be the more appropriate value. >From the link(2) man page: EXDEV oldpath and newpath are not on the same mounted file system. (Linux permits a file system to be mounted at multiple points, but link() does not work across different mount points, even if the same file system is mounted on both.) This matters because an application may have different behaviors based on return codes. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: Per file/directory controls for COW and compressionLiu Bo
Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. According to Chris's comment, there should be just one true compression method(probably LZO) stored in the super. However, before this, we would wait for that one method is stable enough to be adopted into the super. So I list it as a long term goal, and just store it in ram today. After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to control file and directory's datacow and compression attribute. NOTE: - The compression type is selected by such rules: If we mount btrfs with compress options, ie, zlib/lzo, the type is it. Otherwise, we'll use the default compress type (zlib today). v1->v2: - rebase to the latest btrfs. v2->v3: - fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW will be screwed by inheritance from parent directory. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28btrfs: use GFP_NOFS instead of GFP_KERNELMiao Xie
In the filesystem context, we must allocate memory by GFP_NOFS, or we may start another filesystem operation and make kswap thread hang up. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: check return value of read_tree_block()Tsutomu Itoh
This patch is checking return value of read_tree_block(), and if it is NULL, error processing. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28btrfs: properly access unaligned checksum bufferDavid Sterba
On Fri, Mar 18, 2011 at 11:56:53AM -0400, Chris Mason wrote: > Thanks for fielding this one. Does put_unaligned_le32 optimize away on > platforms with efficient access? It would be great if we didn't need > the #ifdef. (quicktest: assembly output is same for put_unaligned_le32 and direct assignment on my x86_64) I was originally following examples in Documentation/unaligned-memory-access.txt. From other code it seems to me that the define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is intended for larger portions of code. Macros/wrappers for {put,get}_unaligned* are chosen via arch/<arch>/include/asm/unaligned.h accordingly, therefore it's safe to use put_unaligned_le32 without the ifdef. dave Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: cleanup some BUG_ON()Tsutomu Itoh
This patch changes some BUG_ON() to the error return. (but, most callers still use BUG_ON()) Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: add initial tracepoint support for btrfsliubo
Tracepoints can provide insight into why btrfs hits bugs and be greatly helpful for debugging, e.g dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0 dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0 btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0) btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0) btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8 flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0) flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0) flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0) Here is what I have added: 1) ordere_extent: btrfs_ordered_extent_add btrfs_ordered_extent_remove btrfs_ordered_extent_start btrfs_ordered_extent_put These provide critical information to understand how ordered_extents are updated. 2) extent_map: btrfs_get_extent extent_map is used in both read and write cases, and it is useful for tracking how btrfs specific IO is running. 3) writepage: __extent_writepage btrfs_writepage_end_io_hook Pages are cirtical resourses and produce a lot of corner cases during writeback, so it is valuable to know how page is written to disk. 4) inode: btrfs_inode_new btrfs_inode_request btrfs_inode_evict These can show where and when a inode is created, when a inode is evicted. 5) sync: btrfs_sync_file btrfs_sync_fs These show sync arguments. 6) transaction: btrfs_transaction_commit In transaction based filesystem, it will be useful to know the generation and who does commit. 7) back reference and cow: btrfs_delayed_tree_ref btrfs_delayed_data_ref btrfs_delayed_ref_head btrfs_cow_block Btrfs natively supports back references, these tracepoints are helpful on understanding btrfs's COW mechanism. 8) chunk: btrfs_chunk_alloc btrfs_chunk_free Chunk is a link between physical offset and logical offset, and stands for space infomation in btrfs, and these are helpful on tracing space things. 9) reserved_extent: btrfs_reserved_extent_alloc btrfs_reserved_extent_free These can show how btrfs uses its space. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28Btrfs: use RCU instead of a spinlock to protect the root nodeChris Mason
The pointer to the extent buffer for the root of each tree is protected by a spinlock so that we can safely read the pointer and take a reference on the extent buffer. But now that the extent buffers are freed via RCU, we can safely use rcu_read_lock instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28eCryptfs: write lock requested keysRoberto Sassu
A requested key is write locked in order to prevent modifications on the authentication token while it is being used. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: move ecryptfs_find_auth_tok_for_sig() call before mutex_lockRoberto Sassu
The ecryptfs_find_auth_tok_for_sig() call is moved before the mutex_lock(s->tfm_mutex) instruction in order to avoid possible deadlocks that may occur by holding the lock on the two semaphores 'key->sem' and 's->tfm_mutex' in reverse order. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: verify authentication tokens before their useRoberto Sassu
Authentication tokens content may change if another requestor calls the update() method of the corresponding key. The new function ecryptfs_verify_auth_tok_from_key() retrieves the authentication token from the provided key and verifies if it is still valid before being used to encrypt or decrypt an eCryptfs file. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> [tyhicks: Minor formatting changes] Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: modified size of keysig in the ecryptfs_key_sig structureRoberto Sassu
The size of the 'keysig' array is incremented of one byte in order to make room for the NULL character. The 'keysig' variable is used, in the function ecryptfs_generate_key_packet_set(), to find an authentication token with the given signature and is printed a debug message if it cannot be retrieved. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: removed num_global_auth_toks from ecryptfs_mount_crypt_statRoberto Sassu
This patch removes the 'num_global_auth_toks' field of the ecryptfs_mount_crypt_stat structure, used to count the number of items in the 'global_auth_tok_list' list. This variable is not needed because there are no checks based upon it. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: ecryptfs_keyring_auth_tok_for_sig() bug fixRoberto Sassu
The pointer '(*auth_tok_key)' is set to NULL in case request_key() fails, in order to prevent its use by functions calling ecryptfs_keyring_auth_tok_for_sig(). Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Cc: <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: Unlock page in write_begin error pathTyler Hicks
Unlock the page in error path of ecryptfs_write_begin(). This may happen, for example, if decryption fails while bring the page up-to-date. Cc: <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28ecryptfs: modify write path to encrypt page in writepageThieu Le
Change the write path to encrypt the data only when the page is written to disk in ecryptfs_writepage. Previously, ecryptfs encrypts the page in ecryptfs_write_end which means that if there are multiple write requests to the same page, ecryptfs ends up re-encrypting that page over and over again. This patch minimizes the number of encryptions needed. Signed-off-by: Thieu Le <thieule@chromium.org> [tyhicks: Changed NULL .drop_inode sop pointer to generic_drop_inode] Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: Remove ECRYPTFS_NEW_FILE crypt stat flagTyler Hicks
Now that grow_file() is not called in the ecryptfs_create() path, the ECRYPTFS_NEW_FILE flag is no longer needed. It helped ecryptfs_readpage() know not to decrypt zeroes that were read from the lower file in the grow_file() path. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-28eCryptfs: Remove unnecessary grow_file() functionTyler Hicks
When creating a new eCryptfs file, the crypto metadata is written out and then the lower file was being "grown" with 4 kB of encrypted zeroes. I suspect that growing the encrypted file was to prevent an information leak that the unencrypted file was empty. However, the unencrypted file size is stored, in plaintext, in the metadata so growing the file is unnecessary. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-03-27Merge branch 'for-linus-1' of git://git.infradead.org/mtd-2.6Linus Torvalds
* 'for-linus-1' of git://git.infradead.org/mtd-2.6: (49 commits) mtd: mtdswap: fix compilation warning mtdswap: kill strict error handling option mtd: nand: enable software BCH ECC in nand simulator mtd: nand: add software BCH ECC support mtd: fix printf format warnings, mostly lack of %zd for size_t, in mtdswap mtd: sm_rtl: check kmalloc return value mtd: cfi: add support for AMIC flashes (e.g. A29L160AT) lib: add shared BCH ECC library mtd: mxc_nand: fix OOB corruption when page size > 2KiB mtd: DaVinci: Removed header file that is not required mtd: pxa3xx_nand: clean the keep configure code mtd: pxa3xx_nand: mtd scan id process could be defined by driver itself mtd: pxa3xx_nand: unify prepare command mtd: pxa3xx_nand: discard wait_for_event,write_cmd,__readid function mtd: pxa3xx_nand: rework irq logic mtd: pxa3xx_nand: make scan procedure more clear mtd: speedtest: fix integer overflow mtd: mxc_nand: fix read past buffer end mtd: omap3: nand: report corrected ecc errors jffs2: remove a trailing white space in commentaries ...