summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2011-05-25Merge branch 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-block: splice: add wakeup_pipe_readers()
2011-05-25Merge branch 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits) cfq-iosched: free cic_index if cfqd allocation fails cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add() cfq-iosched: reduce bit operations in cfq_choose_req() cfq-iosched: algebraic simplification in cfq_prio_to_maxrq() blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time block: move bd_set_size() above rescan_partitions() in __blkdev_get() block: call elv_bio_merged() when merged cfq-iosched: Make IO merge related stats per cpu cfq-iosched: Fix a memory leak of per cpu stats for root group backing-dev: Kill set but not used var in bdi_debug_stats_show() block: get rid of on-stack plugging debug checks blk-throttle: Make no throttling rule group processing lockless blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats blk-cgroup: Make 64bit per cpu stats safe on 32bit arch blk-throttle: Make dispatch stats per cpu blk-throttle: Free up a group only after one rcu grace period blk-throttle: Use helper function to add root throtl group to lists blk-throttle: Introduce a helper function to fill in device details blk-throttle: Dynamically allocate root group blk-cgroup: Allow sleeping while dynamically allocating a group ...
2011-05-25xfs: correctly decrement the extent buffer index in xfs_bmap_del_extentChristoph Hellwig
The code in xfs_bmap_del_extent does not correctly decrement the extent buffer index when deleting a whole extent. Most of the time this gets caught by checks in xfs_bmapi that work around it and decrement it manually and thus wasn't noticed so far. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irecChristoph Hellwig
Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: fix up asserts in xfs_iflush_forkChristoph Hellwig
Remove asserts in xfs_iflush_fork that would call xfs_iext_get_ext with a potentially invalid extent buffer index. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not do pointer arithmetic on extent recordsChristoph Hellwig
We need to call xfs_iext_get_ext for the previous extent to get a valid pointer, and can't just do pointer arithmetics as they might be in different pages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not use unchecked extent indices in xfs_bunmapiChristoph Hellwig
Make sure to only call xfs_iext_get_ext after we've validate the extent index when moving on to the next index in xfs_bunmapi. Also remove the old workaround for too large indices that has been superceeded by the proper fix in xfs_bmap_del_extent. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not use unchecked extent indices in xfs_bmapiChristoph Hellwig
Make sure to only call xfs_iext_get_ext after we've validate the extent index when moving on to the next index in xfs_bmapi. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: do not use unchecked extent indices in xfs_bmap_add_extent_*Christoph Hellwig
Make sure to only call xfs_iext_get_ext after we've validate the extent index in the various xfs_bmap_add_extent_* helpers. Based on an earlier patch from Lachlan McIlroy. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: remove if_lastexChristoph Hellwig
The if_lastex field in struct xfs_ifork is only used as a temporary index during xfs_bmapi and xfs_bunmapi. Instead of using the inode fork to store it keep it local in the callchain. Fortunately this is very easy as we already pass a stack copy of it down the whole chain which can simplify be changed to be passed by reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25xfs: remove the unused XFS_BMAPI_RSVBLOCKS flagChristoph Hellwig
The XFS_BMAPI_RSVBLOCKS is unused, and as far as I can see has always been. Remove it to simplify the bmapi implementation and conserve stack space. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25fs/ncpfs/inode.c: suppress used-uninitialised warningAndrew Morton
We get this spurious warning: fs/ncpfs/inode.c: In function 'ncp_fill_super': fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[1u]' may be used uninitialized in this function fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[2u]' may be used uninitialized in this function fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[3u]' may be used uninitialized in this function ... It's notabug, but we can easily fix it with a memset(). Reported-by: Harry Wei <jiaweiwei.xiyou@gmail.com> Cc: Petr Vandrovec <petr@vandrovec.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25fscache: remove dead code under CONFIG_WORKQUEUE_DEBUGFSAmerigo Wang
There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25proc: allocate storage for numa_maps statistics onceStephen Wilson
In show_numa_map() we collect statistics into a numa_maps structure. Since the number of NUMA nodes can be very large, this structure is not a candidate for stack allocation. Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map() is invoked, perform the allocation just once when /proc/pid/numa_maps is opened. Performing the allocation when numa_maps is opened, and thus before a reference to the target tasks mm is taken, eliminates a potential stalemate condition in the oom-killer as originally described by Hugh Dickins: ... imagine what happens if the system is out of memory, and the mm we're looking at is selected for killing by the OOM killer: while we wait in __get_free_page for more memory, no memory is freed from the selected mm because it cannot reach exit_mmap while we hold that reference. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25proc: make struct proc_maps_private truly privateStephen Wilson
Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps there is no need to export struct proc_maps_private to the world. Move it to fs/proc/internal.h instead. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: proc: move show_numa_map() to fs/proc/task_mmu.cStephen Wilson
Moving show_numa_map() from mempolicy.c to task_mmu.c solves several issues. - Having the show() operation "miles away" from the corresponding seq_file iteration operations is a maintenance burden. - The need to export ad hoc info like struct proc_maps_private is eliminated. - The implementation of show_numa_map() can be improved in a simple manner by cooperating with the other seq_file operations (start, stop, etc) -- something that would be messy to do without this change. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25tmpfs: implement generic xattr supportEric Paris
Implement generic xattrs for tmpfs filesystems. The Feodra project, while trying to replace suid apps with file capabilities, realized that tmpfs, which is used on the build systems, does not support file capabilities and thus cannot be used to build packages which use file capabilities. Xattrs are also needed for overlayfs. The xattr interface is a bit odd. If a filesystem does not implement any {get,set,list}xattr functions the VFS will call into some random LSM hooks and the running LSM can then implement some method for handling xattrs. SELinux for example provides a method to support security.selinux but no other security.* xattrs. As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have xattr handler routines specifically to handle acls. Because of this tmpfs would loose the VFS/LSM helpers to support the running LSM. To make up for that tmpfs had stub functions that did nothing but call into the LSM hooks which implement the helpers. This new patch does not use the LSM fallback functions and instead just implements a native get/set/list xattr feature for the full security.* and trusted.* namespace like a normal filesystem. This means that tmpfs can now support both security.selinux and security.capability, which was not previously possible. The basic implementation is that I attach a: struct shmem_xattr { struct list_head list; /* anchored by shmem_inode_info->xattr_list */ char *name; size_t size; char value[0]; }; Into the struct shmem_inode_info for each xattr that is set. This implementation could easily support the user.* namespace as well, except some care needs to be taken to prevent large amounts of unswappable memory being allocated for unprivileged users. [mszeredi@suse.cz: new config option, suport trusted.*, support symlinks] Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Hugh Dickins <hughd@google.com> Tested-by: Jordi Pujol <jordipujolp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25vmscan: change shrinker API by passing shrink_control structYing Han
Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25vmscan: change shrink_slab() interfaces by passing shrink_controlYing Han
Consolidate the existing parameters to shrink_slab() into a new shrink_control struct. This is needed later to pass the same struct to shrinkers. Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Convert i_mmap_lock to a mutexPeter Zijlstra
Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Remove i_mmap_lock lockbreakPeter Zijlstra
Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: 2aa15890f3c ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: mmu_gather reworkPeter Zijlstra
Rework the existing mmu_gather infrastructure. The direct purpose of these patches was to allow preemptible mmu_gather, but even without that I think these patches provide an improvement to the status quo. The first 9 patches rework the mmu_gather infrastructure. For review purpose I've split them into generic and per-arch patches with the last of those a generic cleanup. The next patch provides generic RCU page-table freeing, and the followup is a patch converting s390 to use this. I've also got 4 patches from DaveM lined up (not included in this series) that uses this to implement gup_fast() for sparc64. Then there is one patch that extends the generic mmu_gather batching. After that follow the mm preemptibility patches, these make part of the mm a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to mutexes which together with the mmu_gather rework makes mmu_gather preemptible as well. Making i_mmap_lock a mutex also enables a clean-up of the truncate code. This also allows for preemptible mmu_notifiers, something that XPMEM I think wants. Furthermore, it removes the new and universially detested unmap_mutex. This patch: Remove the first obstacle towards a fully preemptible mmu_gather. The current scheme assumes mmu_gather is always done with preemption disabled and uses per-cpu storage for the page batches. Change this to try and allocate a page for batching and in case of failure, use a small on-stack array to make some progress. Preemptible mmu_gather is desired in general and usable once i_mmap_lock becomes a mutex. Doing it before the mutex conversion saves us from having to rework the code by moving the mmu_gather bits inside the pte_lock. Also avoid flushing the tlb batches from under the pte lock, this is useful even without the i_mmap_lock conversion as it significantly reduces pte lock hold times. [akpm@linux-foundation.org: fix comment tpyo] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: make expand_downwards() symmetrical with expand_upwards()Michal Hocko
Currently we have expand_upwards exported while expand_downwards is accessible only via expand_stack or expand_stack_downwards. check_stack_guard_page is a nice example of the asymmetry. It uses expand_stack for VM_GROWSDOWN while expand_upwards is called for VM_GROWSUP case. Let's clean this up by exporting both functions and make those names consistent. Let's use expand_{upwards,downwards} because expanding doesn't always involve stack manipulation (an example is ia64_do_page_fault which uses expand_upwards for registers backing store expansion). expand_downwards has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards version in the early process initialization phase for growsup configuration. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25fs/9p: Don't clunk dentry fid when we fail to get a writeback inodeAneesh Kumar K.V
The dentry fid get clunked via the dput. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-05-259p: remove experimental tag from tested configurationsEric Van Hensbergen
The 9p client is currently undergoing regular regresssion and stress testing as a by-product of the virtfs work. I think its finally time to take off the experimental tags from the well-tested code paths. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-05-25ext4: do not normalize block requests from fallocate()Vivek Haldar
Currently, an fallocate request of size slightly larger than a power of 2 is turned into two block requests, each a power of 2, with the extra blocks pre-allocated for future use. When an application calls fallocate, it already has an idea about how large the file may grow so there is usually little benefit to reserve extra blocks on the preallocation list. This reduces disk fragmentation. Tested: fsstress. Also verified manually that fallocat'ed files are contiguously laid out with this change (whereas without it they begin at power-of-2 boundaries, leaving blocks in between). CPU usage of fallocate is not appreciably higher. In a tight fallocate loop, CPU usage hovers between 5%-8% with this change, and 5%-7% without it. Using a simulated file system aging program which the file system to 70%, the percentage of free extents larger than 8MB (as measured by e2freefrag) increased from 38.8% without this change, to 69.4% with this change. Signed-off-by: Vivek Haldar <haldar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25ext4: enable "punch hole" functionalityAllison Henderson
This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole" and "ext4_ext_check_cache" fallocate has been modified to call ext4_punch_hole when the punch hole flag is passed. At the moment, we only support punching holes in extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole routine. The ext4_ext_punch_hole routine first completes all outstanding writes with the associated pages, and then releases them. The unblock aligned data is zeroed, and all blocks in between are punched out. The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache except it accepts a ext4_ext_cache parameter instead of a ext4_extent parameter. This routine is used by ext4_ext_punch_hole to check and see if a block in a hole that has been cached. The ext4_ext_cache parameter is necessary because the members ext4_extent structure are not large enough to hold a 32 bit value. The existing ext4_ext_in_cache routine has become a wrapper to this new function. [ext4 punch hole patch series 5/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25ext4: add "punch hole" flag to ext4_map_blocks()Allison Henderson
This patch adds a new flag to ext4_map_blocks() that specifies the given range of blocks should be punched out. Extents are first converted to uninitialized extents before they are punched out. Because punching a hole may require that the extent be split, it is possible that the splitting may need more blocks than are available. To deal with this, use of reserved blocks are enabled to allow the split to proceed. The routine then returns the number of blocks successfully punched out. [ext4 punch hole patch series 4/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25ext4: punch out extentsAllison Henderson
This patch modifies the truncate routines to support hole punching Below is a brief summary of the patches changes: - Added end param to ext_ext4_rm_leaf This function has been modified to accept an end parameter which enables it to punch holes in leafs instead of just truncating them. - Implemented the "remove head" case in the ext_remove_blocks routine This routine is used by ext_ext4_rm_leaf to remove the tail of an extent during a truncate. The new ext_ext4_rm_leaf routine will now also use it to remove the head of an extent in the case that the hole covers a region of blocks at the beginning of an extent. - Added "end" param to ext4_ext_remove_space routine This function has been modified to accept a stop parameter, which is passed through to ext4_ext_rm_leaf. [ext4 punch hole patch series 3/5 v6] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25ext4: add new function ext4_block_zero_page_range()Allison Henderson
This patch modifies the existing ext4_block_truncate_page() function which was used by the truncate code path, and which zeroes out block unaligned data, by adding a new length parameter, and renames it to ext4_block_zero_page_rage(). This function can now be used to zero out the head of a block, the tail of a block, or the middle of a block. The ext4_block_truncate_page() function is now a wrapper to ext4_block_zero_page_range(). [ext4 punch hole patch series 2/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25ext4: add flag to ext4_has_free_blocksAllison Henderson
This patch adds an allocation request flag to the ext4_has_free_blocks function which enables the use of reserved blocks. This will allow a punch hole to proceed even if the disk is full. Punching a hole may require additional blocks to first split the extents. Because ext4_has_free_blocks is a low level function, the flag needs to be passed down through several functions listed below: ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth ext4_ext_split ext4_ext_new_meta_block ext4_mb_new_blocks ext4_claim_free_blocks ext4_has_free_blocks [ext4 punch hole patch series 1/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25GFS2: Processes waiting on inode glock that no processes are holdingBob Peterson
This patch fixes a race in the GFS2 glock state machine that may result in lockups. The symptom is that all nodes but one will hang, waiting for a particular glock. All the holder records will have the "W" (Waiting) bit set. The other node will typically have the glock stuck in Exclusive mode (EX) with no holder records, but the dinode will be cached. In other words, an entry with "I:" will appear in the glock dump for that glock, but nothing else. The race has to do with the glock "Pending Demote" bit, which can be set, then immediately reset, thus losing the fact that another node needs the glock. The sequence of events is: 1. Something schedules the glock workqueue (e.g. glock request from fs) 2. The glock workqueue gets to the point between the test of the reply pending bit and the spin lock: if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) { finish_xmote(gl, gl->gl_reply); drop_ref = 1; } down_read(&gfs2_umount_flush_sem); <---- i.e. here spin_lock(&gl->gl_spin); 3. In comes (a) the reply to our EX lock request setting GLF_REPLY_PENDING and (b) the demote request which sets GLF_PENDING_DEMOTE 4. The following test is executed: if (test_and_clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) && gl->gl_state != LM_ST_UNLOCKED && gl->gl_demote_state != LM_ST_EXCLUSIVE) { This resets the pending demote flag, and gl->gl_demote_state is not equal to exclusive, however because the reply from the dlm arrived after we checked for the GLF_REPLY_PENDING flag, gl->gl_state is still equal to unlocked, so although we reset the GLF_PENDING_DEMOTE flag, we didn't then set the GLF_DEMOTE flag or reinstate the GLF_PENDING_DEMOTE_FLAG. The patch closes the timing window by only transitioning the "Pending demote" bit to the "demote" flag once we know the other conditions (not unlocked and not exclusive) are met. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-05-25Ocfs2/move_extents: Set several trivial constraints for threshold.Tristan Ye
The threshold should be greater than clustersize and less than i_size. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: Let defrag handle partial extent moving.Tristan Ye
We're going to support partial extent moving, which may split entire extent movement into pieces to compromise the insuffice allocations, it eases the 'ENSPC' pain and makes the whole moving much less likely to fail, the downside is it may make the fs even more fragmented before moving, just let the userspace make a trade-off here. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: move/defrag extents within a certain range.Tristan Ye
the basic logic of moving extents for a file is pretty like punching-hole sequence, walk the extents within the range as user specified, calculating an appropriate len to defrag/move, then let ocfs2_defrag/move_extent() to do the actual moving. This func ends up setting 'OCFS2_MOVE_EXT_FL_COMPLETE' to userpace if operation gets done successfully. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: helper to calculate the defraging length in one run.Tristan Ye
The helper is to calculate the defrag length in one run according to a threshold, it will proceed doing defragmentation until the threshold was meet, and skip a LARGE extent if any. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: move entire/partial extent.Tristan Ye
ocfs2_move_extent() logic will validate the goal_offset_in_block, where extents to be moved, what's more, it also compromises a bit to probe the appropriate region around given goal_offset when the original goal is not able to fit the movement. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: helpers to update the group descriptor and global bitmap ↵Tristan Ye
inode. These helpers were actually borrowed from alloc.c, which may be publicized later. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.Tristan Ye
Before doing the movement of extents, we'd better probe the alloc group from 'goal_blk' for searching a contiguous region to fit the wanted movement, we even will have a best-effort try by compromising to a threshold around the given goal. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: helper to validate and adjust moving goal.Tristan Ye
First best-effort attempt to validate and adjust the goal (physical address in block), while it can't guarantee later operation can succeed all the time since global_bitmap may change a bit over time. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.Tristan Ye
This function tries locate the right alloc group, where a given physical block resides, it returns the caller a buffer_head of victim group descriptor, and also the offset of block in this group, by passing the block number. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: defrag a range of extent.Tristan Ye
It's a relatively complete function to accomplish defragmentation for entire or partial extent, one journal handle was kept during the operation, it was logically doing one more thing than ocfs2_move_extent() acutally, yes, it's claiming the new clusters itself;-) Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: move a range of extent.Tristan Ye
The moving range of __ocfs2_move_extent() was within one extent always, it consists following parts: 1. Duplicates the clusters in pages to new_blkoffset, where extent to be moved. 2. Split the original extent with new extent, coalecse the nearby extents if possible. 3. Append old clusters to truncate log, or decrease_refcount if the extent was refcounted. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: lock allocators and reserve metadata blocks and data ↵Tristan Ye
clusters for extents moving. ocfs2_lock_allocators_move_extents() was like the common ocfs2_lock_allocators(), to lock metadata and data alloctors during extents moving, reserve appropriate metadata blocks and data clusters, also performa a best- effort to calculate the credits for journal transaction in one run of movement. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: Add basic framework and source files for extent moving.Tristan Ye
Adding new files move_extents.[c|h] and fill it with nothing but only a context structure. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.Tristan Ye
Patch also manages to add a manipulative struture for this ioctl. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.cTristan Ye
The original goal of commonizing these funcs is to benefit defraging/extent_moving codes in the future, based on the fact that reflink and defragmentation having the same Copy-On-Wrtie mechanism. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.Tristan Ye
This new code is a bit more complicated than former ones, the goal is to show user all statistics required to take a deep insight into filesystem on how the disk is being fragmentaed. The goal is achieved by scaning global bitmap from (cluster)group to group to figure out following factors in the filesystem: - How many free chunks in a fixed size as user requested. - How many real free chunks in all size. - Min/Max/Avg size(in) clusters of free chunks. - How do free chunks distribute(in size) in terms of a histogram, just like following: --------------------------------------------------------- Extent Size Range : Free extents Free Clusters Percent 32K... 64K- : 1 1 0.00% 1M... 2M- : 9 288 0.03% 8M... 16M- : 2 831 0.09% 32M... 64M- : 1 2047 0.23% 128M... 256M- : 1 8191 0.92% 256M... 512M- : 2 21706 2.43% 512M... 1024M- : 27 858623 96.29% --------------------------------------------------------- Userspace ioctl() call eventually gets the above info returned by passing a 'struct ocfs2_info_freefrag' with the chunk_size being specified first. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.Tristan Ye
The new code is dedicated to calculate free inodes number of all inode_allocs, then return the info to userpace in terms of an array. Specially, flag 'OCFS2_INFO_FL_NON_COHERENT', manipulated by '--cluster-coherent' from userspace, is now going to be involved. setting the flag on means no cluster coherency considered, usually, userspace tools choose none-coherency strategy by default for the sake of performace. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25Ocfs2: Using inline funcs to set/clear *FILLED* flags in info handler.Tristan Ye
It just removes some macros for the sake of typechecking gains. Signed-off-by: Tristan Ye <tristan.ye@oracle.com>