Age | Commit message (Collapse) | Author |
|
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30547a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30546a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30545a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
Now that the ktrace_enter() code is using atomics, the non-power-of-2
buffer sizes - which require modulus operations to get the index - are
showing up as using substantial CPU in the profiles.
Force the buffer sizes to be rounded up to the nearest power of two and
use masking rather than modulus operations to convert the index counter to
the buffer index. This reduces ktrace_enter overhead to 8% of a CPU time,
and again almost halves the trace intensive test runtime.
SGI-PV: 977546
SGI-Modid: xfs-linux-melb:xfs-kern:30538a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
ktrace_enter() is consuming vast amounts of CPU time due to the use of a
single global lock for protecting buffer index increments. Change it to
use per-buffer atomic counters - this reduces ktrace_enter() overhead
during a trace intensive test on a 4p machine from 58% of all CPU time to
12% and halves test runtime.
SGI-PV: 977546
SGI-Modid: xfs-linux-melb:xfs-kern:30537a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
XFS changes the c/mtime of an inode when truncating it to the same size.
The c/mtime is only supposed to change if the size is changed. Not to be
confused with ftruncate, where the c/mtime is supposed to be changed even
if the size is not changed.
The Linux VFS encodes this semantic difference in the flags it sends down
to ->setattr, which XFS currently ignores. We need to make XFS pay
attention to the VFS flags and hence Do The Right Thing.
SGI-PV: 977547
SGI-Modid: xfs-linux-melb:xfs-kern:30536a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
As Dave pointed out after the export ops changes we now always encode the
parent into the filehandle for regular files, but it's not actually needed
when the filesystem is export with no_subtree_check. This one-liner fixes
xfs_fs_encode_fh to skip encoding the parent unless nessecary.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30535a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
We can just use xfs_ilock/xfs_iunlock instead and get rid of the ugly
bhv_vrwlock_t.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30533a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
Instead of of xfs_get_dir_entry use a macro to get the xfs_inode from the
dentry in the callers and grab the reference manually.
Only grab the reference once as it's fine to keep it over the dmapi calls.
(And even that reference is actually superflous in Linux but I'll leave
that for another patch)
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30531a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
Cleanup the unneeded intermediate vnode step in the flushing helpers and
go directly from the xfs_inode to the struct address_space.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30530a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
- use proper goto based unwinding instead of the current mess of
multiple conditionals
- rename ip to inode because that's the normal convention for Linux
inodes while ip is the convention for xfs_inodes
- remove unlikely checks for the default_acl - branches marked unlikely
might lead to extreme branch bredictor slowdons if taken and for some
workloads a default acl is quite common
- properly indent the switch statements
- remove xfs_has_fs_struct as nfsd has a fs_struct in any semi-recent
kernel
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30529a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
Now that we update the log tail LSN less frequently on transaction
completion, we pass the contention straight to the global log state lock
(l_iclog_lock) during transaction completion.
We currently have to take this lock to decrement the iclog reference
count. there is a reference count on each iclog, so we need to take þhe
global lock for all refcount changes.
When large numbers of processes are all doing small trnasctions, the iclog
reference counts will be quite high, and the state change that absolutely
requires the l_iclog_lock is the except rather than the norm.
Change the reference counting on the iclogs to use atomic_inc/dec so that
we can use atomic_dec_and_lock during transaction completion and avoid the
need for grabbing the l_iclog_lock for every reference count decrement
except the one that matters - the last.
SGI-PV: 975671
SGI-Modid: xfs-linux-melb:xfs-kern:30505a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
When hundreds of processors attempt to commit transactions at the same
time, they can contend on the AIL lock when updating the tail LSN held in
the in-core log structure.
At the moment, the tail LSN is only needed when actually writing out an
iclog, so it really does not need to be updated on every single
transaction completion - only those that result in switching iclogs and
flushing them to disk.
The result is that we reduce the number of times we need to grab the AIL
lock and the log grant lock by up to two orders of magnitude on large
processor count machines. The problem has previously been hidden by AIL
lock contention walking the AIL list which was recently solved and
uncovered this issue.
SGI-PV: 975671
SGI-Modid: xfs-linux-melb:xfs-kern:30504a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
Remove open coded checks for the whether the inode is clean and replace
them with an inlined function.
SGI-PV: 977461
SGI-Modid: xfs-linux-melb:xfs-kern:30503a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
Remove the xfs_icluster structure and replace with a radix tree lookup.
We don't need to keep a list of inodes in each cluster around anymore as
we can look them up quickly when we need to. The only time we need to do
this now is during inode writeback.
Factor the inode cluster writeback code out of xfs_iflush and convert it
to use radix_tree_gang_lookup() instead of walking a list of inodes built
when we first read in the inodes.
This remove 3 pointers from each xfs_inode structure and the xfs_icluster
structure per inode cluster. Hence we reduce the cache footprint of the
xfs_inodes by between 5-10% depending on cluster sparseness.
To be truly efficient we need a radix_tree_gang_lookup_range() call to
stop searching once we are past the end of the cluster instead of trying
to find a full cluster's worth of inodes.
Before (ia64):
$ cat /sys/slab/xfs_inode/object_size 536
After:
$ cat /sys/slab/xfs_inode/object_size 512
SGI-PV: 977460
SGI-Modid: xfs-linux-melb:xfs-kern:30502a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
When pdflush is writing back inodes, it can get stuck on inode cluster
buffers that are currently under I/O. This occurs when we write data to
multiple inodes in the same inode cluster at the same time.
Effectively, delayed allocation marks the inode dirty during the data
writeback. Hence if the inode cluster was flushed during the writeback of
the first inode, the writeback of the second inode will block waiting for
the inode cluster write to complete before writing it again for the newly
dirtied inode.
Basically, we want to avoid this from happening so we don't block pdflush
and slow down all of writeback. Hence we introduce a non-blocking async
inode flush flag that pdflush uses. If this flag is set, we use
non-blocking operations (e.g. try locks) whereever we can to avoid
blocking or extra I/O being issued.
SGI-PV: 970925
SGI-Modid: xfs-linux-melb:xfs-kern:30501a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
The only difference between the functions is one passes an inode for the
lookup, the other passes an inode number. However, they don't do the same
validity checking or set all the same state on the buffer that is returned
yet they should.
Factor the functions into a common implementation.
SGI-PV: 970925
SGI-Modid: xfs-linux-melb:xfs-kern:30500a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30490a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
|
|
Remove the xfs_refcache, it was only needed while we were still
building for 2.4 kernels.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30472a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
On a forced shutdown, xfs_finish_reclaim() will skip flushing the inode.
If the inode flush lock is not already held and there is an outstanding
xfs_iflush_done() then we might free the inode prematurely. By acquiring
and releasing the flush lock we will synchronise with xfs_iflush_done().
SGI-PV: 909874
SGI-Modid: xfs-linux-melb:xfs-kern:30468a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
|
|
bailout if fails.
SGI-PV: 973041
SGI-Modid: xfs-linux-melb:xfs-kern:30462a
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
|
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
|
|
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
jbd2 debugfs and stats entries should only be created if cache initialisation
is successful. At the moment they are being created unconditionally which
will leave them dangling if cache (and hence module) initialisation fails.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Fix kernel-doc notation in jbd2.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
If an error occurs during jbd2 cache initialisation it is possible for the
journal_head_cache to be NULL when jbd2_journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.
Note that even with this fix things will break badly if jbd2 is statically
compiled in and cache initialisation fails.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The revocation table initialisation/destruction code is repeated for each of
the two revocation tables stored in the journal. Refactoring the duplicated
code into functions is tidier, simplifies the logic in initialisation in
particular, and slightly reduces the code size.
There should not be any functional change.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Make revocation cache destruction safe to call if initialisation fails
partially or entirely. This allows it to be used to cleanup in the case of
initialisation failure, simplifying the code slightly.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
|
This patch makes the needlessly global ext4_xattr_list() static.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The function prototype of ext4_new_blocks_old() is defined in ext4_fs.h,
so we don't need the extra function prototype in mballoc.c
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Check ext4_journal_get_write_access() errors.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: adilger@clusterfs.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mingming Cao <cmm@us.ibm.com>
|
|
Use ext4_get_group_desc() in ext4_get_inode_block() instead of open
coding the functionality.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: adilger@clusterfs.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mingming Cao <cmm@us.ibm.com>
|
|
Use ext4_group_first_block_no() and assign the return values to
ext4_fsblk_t variables.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: adilger@clusterfs.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mingming Cao <cmm@us.ibm.com>
|
|
Convert byte order of constant instead of variable which can be done at
compile time (vs run time).
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
replace all:
little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
expression_in_cpu_byteorder);
with:
leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: sct@redhat.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: adilger@clusterfs.com
Cc: Mingming Cao <cmm@us.ibm.com>
|
|
The list_for_each_entry_rcu() primitive should be used instead of
list_for_each_rcu(), as the former is easier to use and provides
better type safety.
http://groups.google.com/group/linux.kernel/browse_thread/thread/45749c83451cebeb/0633a65759ce7713?lnk=raot
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Roel Kluin <12o3l@tiscali.nl>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
mballoc.c is a whole lot of static functions, which gcc seems to
really like to inline.
With the changes below, on x86, I can at least get from:
432 ext4_mb_new_blocks
240 ext4_mb_free_blocks
208 ext4_mb_discard_group_preallocations
188 ext4_mb_seq_groups_show
164 ext4_mb_init_cache
152 ext4_mb_release_inode_pa
136 ext4_mb_seq_history_show
...
to
220 ext4_mb_free_blocks
188 ext4_mb_seq_groups_show
176 ext4_mb_regular_allocator
164 ext4_mb_init_cache
156 ext4_mb_new_blocks
152 ext4_mb_release_inode_pa
136 ext4_mb_seq_history_show
124 ext4_mb_release_group_pa
...
which still has some big functions in there, but not 432 bytes!
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
I checked ext4_ioctl and it looked largely safe to not be used
without BKL. So convert it over to unlocked_ioctl.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When we convert an uninitialized extent to an initialized extent
we need to make sure we return the number of blocks in the
extent from the file system block corresponding to logical
file block. Otherwise we cache wrong extent details and this
results in file system corruption.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
ext4_ext_get_blocks() returns the number of blocks allocated with buffer
head unmapped for a read from prealloc space. This is needed so that
delayed allocation doesn't do block reservation for prealloc space
since the blocks are already reserved on disk. Mark the buffer head
unwritten. Some code paths try to read the block if the buffer_head is
not new and no uptodate. Marking the buffer head unwritten avoids this
reading.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
ext4_ext_get_blocks() returns number of blocks allocated with buffer
heads unmapped for a read from prealloc space. This is needed so that
delayed allocation doesn't do block reservation for prealloc space since
the blocks are already resevred on disk. Fix ext4_ext_get_blocks to not
return greater than max_blocks, since some of the code paths cannot
handle such a return value.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
ext4_fallocate needs to update file size in each transaction. Otherwise
if we crash the file size won't be seen. We were also not marking
the inode dirty after updating file size before. Also when we try to
retry allocation due to ENOSPC, make sure we reset the variable ret so
that we actually do a retry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Fail migrate if we allocated new blocks via mmap write.
If we write to holes in the file via mmap, we end up allocating
new blocks. This block allocation happens without taking inode->i_mutex.
Since migrate is protected by i_mutex and migrate expects that no
new blocks get allocated during migrate, fail migrate if new blocks
get allocated.
We can't take inode->i_mutex in the mmap write path because that
would result in a locking order violation between i_mutex and mmap_sem.
Also adding a separate rw_sempahore for protection is really high overhead
for a rare operation such as migrate.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
If the preallocated area is small zero out the full extent
instead of splitting them. This should avoid the "write
every alternate block" problem that could grow the number
of extents dramatically.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
This patch handles possible ENOSPC errors when writing to an
uninitialized extent in case the filesystem is full.
A write to a prealloc area causes the split of an unititalized extent
into initialized and uninitialized extents. If we don't have
space to add new extent information, instead of returning error,
convert the existing uninitialized extent to initialized one. We
need to zero out the blocks corresponding to the entire extent to
prevent uninitialized data reaching userspace.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
This patch enables extent-formatted normal symlinks. Using extents
format allows a symlink to refer to a block number larger than 2^32
on large filesystems. We still don't enable extent format for fast
symlinks, which are contained in the inode itself.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Put the old extent details back if we fail to split the
uninitialized extent.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
The "resize" option won't be noticed as it comes after the NULL option,
so if you try to mount (or in this case remount) with that option it
won't be recognized.
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|