Age | Commit message (Collapse) | Author |
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We need to ensure that we clear NFS4_SLOT_TBL_DRAINING on the back
channel when we're done recovering the session.
Regression introduced by commit 774d5f14e (NFSv4.1 Fix a pNFS session
draining deadlock)
Signed-off-by: Andy Adamson <andros@netapp.com>
[Trond: Changed order to start back-channel first. Minor code cleanup]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>=3.10]
|
|
This patch exploits pstore subsystem to read details of common partition
in NVRAM to a separate file in /dev/pstore. For instance, common partition
details will be stored in a file named [common-nvram-6].
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch set exploits the pstore subsystem to read details of
of-config partition in NVRAM to a separate file in /dev/pstore.
For instance, of-config partition details will be stored in a
file named [of-nvram-5].
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch set exploits the pstore subsystem to read details of rtas partition
in NVRAM to a separate file in /dev/pstore. For instance, rtas details will be
stored in a file named [rtas-nvram-4].
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch correctly distinguishes two boundary conditions:
1. When the given range is entire within the unaccounted space between
two rgrps, and
2. The range begins beyond the end of the filesystem
Also fix the unit of the returned value r.len (total trimming) to be in bytes
instead of the (incorrect) 512 byte blocks
With this patch, GFS2 passes multiple iterations of all the relevant xfstests
(251, 260, 288) with different fs block sizes.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch fixes a warning message introduced in the recent
"GFS2: aggressively issue revokes in gfs2_log_flush" patch.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if
mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)"
branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so
we always be emitting a warning and returning an error.
Hence, we can remove it and get rid of a couple of redundant
check up against it at xfs_upate_alignment().
Thanks Dave Chinner for the suggestions of simplify the code
in xfs_parseargs().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Upstream commit 5b292ae3a951a58e32119d73c7ac8f5bec7395a3
xfs: make use of xfs_calc_buf_res() in xfs_trans.c
Beginning from above commit, neither XFS_ALLOCFREE_LOG_RES() nor
XFS_DIROP_LOG_RES() is used by those routines for calculating
transaction space reservations, so it's safe to remove them now.
Also, with a slightly update for the relevant comments to reflect
the ideas of why those log count numbers should be.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
For FIEMAP ioctl(2), if an extent is in delayed allocation
state, we need to return the FIEMAP_EXTENT_UNKNOWN flag except
the FIEMAP_EXTENT_DELALLOC because its data location is unknown.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Adding an extended attribute to a symbolic link can force that
link to an remote extent. xfs_inactive() incorrectly assumes
that any symbolic link small enough to be in the inode core
is incore, resulting in the remote extent to not be removed.
xfs_ifree() will assert on presence of this leaked remote extent.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
This fixes POSIX locks and possibly a few other v4.2 features, like
readdir plus.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Under certain circumstances, spin_is_locked() is hardwired to 0 - even when the
code would normally be in a locked section where it should return 1. This
means it cannot be used for an assertion that checks that a spinlock is locked.
Remove such usages from FS-Cache.
The following oops might otherwise be observed:
FS-Cache: Assertion failed
BUG: failure at fs/fscache/operation.c:270/fscache_start_operations()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 10 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-00133-ge7ebb75 #2
Workqueue: fscache_operation fscache_op_work_func [fscache]
7f091c48 603c8947 7f090000 7f9b1361 7f25f080 00000001 7f26d440 7f091c90
60299eb8 7f091d90 602951c5 7f26d440 3000000008 7f091da0 7f091cc0 7f091cd0
00000007 00000007 00000006 7f091ae0 00000010 0000010e 7f9af330 7f091ae0
Call Trace:
7f091c88: [<60299eb8>] dump_stack+0x17/0x19
7f091c98: [<602951c5>] panic+0xf4/0x1e9
7f091d38: [<6002b10e>] set_signals+0x1e/0x40
7f091d58: [<6005b89e>] __wake_up+0x4e/0x70
7f091d98: [<7f9aa003>] fscache_start_operations+0x43/0x50 [fscache]
7f091da8: [<7f9aa1e3>] fscache_op_complete+0x1d3/0x220 [fscache]
7f091db8: [<60082985>] unlock_page+0x55/0x60
7f091de8: [<7fb25bb0>] cachefiles_read_copier+0x250/0x330 [cachefiles]
7f091e58: [<7f9ab03c>] fscache_op_work_func+0xac/0x120 [fscache]
7f091e88: [<6004d5b0>] process_one_work+0x250/0x3a0
7f091ef8: [<6004edc7>] worker_thread+0x177/0x2a0
7f091f38: [<6004ec50>] worker_thread+0x0/0x2a0
7f091f58: [<60054418>] kthread+0xd8/0xe0
7f091f68: [<6005bb27>] finish_task_switch.isra.64+0x37/0xa0
7f091fd8: [<600185cf>] new_thread_handler+0x8f/0xb0
Reported-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
|
|
struct fscache_retrieval contains a count of the number of pages that still
need some processing (n_pages). This is decremented as the pages are
processed.
However, this needs to be atomic as fscache_retrieval_complete() (I think) just
occasionally may be called from cachefiles_read_backing_file() and
cachefiles_read_copier() simultaneously.
This happens when an fscache_read_or_alloc_pages() request containing a lot of
pages (say a couple of hundred) is being processed. The read on each backing
page is dispatched individually because we need to insert a monitor into the
waitqueue to catch when the read completes. However, under low-memory
conditions, we might be forced to wait in the allocator - and this gives the
I/O on the backing page a chance to complete first.
When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
the workqueue without waiting for the operation to finish the initial I/O
dispatch (we want to release any pages we can as soon as we can), thus both can
end up running simultaneously and potentially attempting to partially complete
the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
the page cache).
This was demonstrated by parallelling the non-atomic counter with an atomic
counter and printing both of them when the assertion fails. At this point, the
atomic counter has reached zero, but the non-atomic counter has not.
To fix this, make the counter an atomic_t.
This results in the following bug appearing
FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:421!
or
FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:414!
With a backtrace like the following:
RIP: 0010:[<ffffffffa0211b1d>] fscache_put_operation+0x1ad/0x240 [fscache]
Call Trace:
[<ffffffffa0213185>] fscache_retrieval_work+0x55/0x270 [fscache]
[<ffffffffa0213130>] ? fscache_retrieval_work+0x0/0x270 [fscache]
[<ffffffff81090b10>] worker_thread+0x170/0x2a0
[<ffffffff81096d10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff810909a0>] ? worker_thread+0x0/0x2a0
[<ffffffff81096966>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffff810968d0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
Simplify the way fscache cache objects retain their cookie. The way I
implemented the cookie storage handling made synchronisation a pain (ie. the
object state machine can't rely on the cookie actually still being there).
Instead of the the object being detached from the cookie and the cookie being
freed in __fscache_relinquish_cookie(), we defer both operations:
(*) The detachment of the object from the list in the cookie now takes place
in fscache_drop_object() and is thus governed by the object state machine
(fscache_detach_from_cookie() has been removed).
(*) The release of the cookie is now in fscache_object_destroy() - which is
called by the cache backend just before it frees the object.
This means that the fscache_cookie struct is now available to the cache all the
way through from ->alloc_object() to ->drop_object() and ->put_object() -
meaning that it's no longer necessary to take object->lock to guarantee access.
However, __fscache_relinquish_cookie() doesn't wait for the object to go all
the way through to destruction before letting the netfs proceed. That would
massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the
cookie around, in must therefore break all attachments to the netfs - which
includes ->def, ->netfs_data and any outstanding page read/writes.
To handle this, struct fscache_cookie now has an n_active counter:
(1) This starts off initialised to 1.
(2) Any time the cache needs to get at the netfs data, it calls
fscache_use_cookie() to increment it - if it is not zero. If it was zero,
then access is not permitted.
(3) When the cache has finished with the data, it calls fscache_unuse_cookie()
to decrement it. This does a wake-up on it if it reaches 0.
(4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
reach 0. The initialisation to 1 in step (1) ensures that we only get
wake ups when we're trying to get rid of the cookie.
This leaves __fscache_relinquish_cookie() a lot simpler.
***
This fixes a problem in the current code whereby if fscache_invalidate() is
followed sufficiently quickly by fscache_relinquish_cookie() then it is
possible for __fscache_relinquish_cookie() to have detached the cookie from the
object and cleared the pointer before a thread is dispatched to process the
invalidation state in the object state machine.
Since the pending write clearance was deferred to the invalidation state to
make it asynchronous, we need to either wait in relinquishment for the stores
tree to be cleared in the invalidation state or we need to handle the clearance
in relinquishment.
Further, if the relinquishment code does clear the tree, then the invalidation
state need to make the clearance contingent on still having the cookie to hand
(since that's where the tree is rooted) and we have to prevent the cookie from
disappearing for the duration.
This can lead to an oops like the following:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
...
RIP: 0010:[<ffffffff8151023e>] _spin_lock+0xe/0x30
...
CR2: 000000000000000c ...
...
Process kslowd002 (...)
....
Call Trace:
[<ffffffffa01c3278>] fscache_invalidate_writes+0x38/0xd0 [fscache]
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff8105e759>] ? find_busiest_queue+0x69/0x150
[<ffffffff8110ddd4>] ? slow_work_enqueue+0x104/0x180
[<ffffffffa01c1303>] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
[<ffffffff81096b67>] ? bit_waitqueue+0x17/0xd0
[<ffffffff8110e233>] slow_work_execute+0x233/0x310
[<ffffffff8110e515>] slow_work_thread+0x205/0x360
[<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8110e310>] ? slow_work_thread+0x0/0x360
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffff810968a0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
The parameter to fscache_invalidate_writes() was object->cookie which is NULL.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
Fix object state machine to have separate work and wait states as that makes
it easier to envision.
There are now three kinds of state:
(1) Work state. This is an execution state. No event processing is performed
by a work state. The function attached to a work state returns a pointer
indicating the next state to which the OSM should transition. Returning
NO_TRANSIT repeats the current state, but goes back to the scheduler
first.
(2) Wait state. This is an event processing state. No execution is
performed by a wait state. Wait states are just tables of "if event X
occurs, clear it and transition to state Y". The dispatcher returns to
the scheduler if none of the events in which the wait state has an
interest are currently pending.
(3) Out-of-band state. This is a special work state. Transitions to normal
states can be overridden when an unexpected event occurs (eg. I/O error).
Instead the dispatcher disables and clears the OOB event and transits to
the specified work state. This then acts as an ordinary work state,
though object->state points to the overridden destination. Returning
NO_TRANSIT resumes the overridden transition.
In addition, the states have names in their definitions, so there's no need for
tables of state names. Further, the EV_REQUEUE event is no longer necessary as
that is automatic for work states.
Since the states are now separate structs rather than values in an enum, it's
not possible to use comparisons other than (non-)equality between them, so use
some object->flags to indicate what phase an object is in.
The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
(EV_KILL). An object flag now carries the information about retirement.
Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
into an KILL_OBJECT state and additional states have been added for handling
waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).
A state has also been added for synchronising with parent object initialisation
(WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
Wrap checks on object state (mostly outside of fs/fscache/object.c) with
inline functions so that the mechanism can be replaced.
Some of the state checks within object.c are left as-is as they will be
replaced.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
Uninline fscache_object_init() so as not to expose some of the FS-Cache
internals to the cache backend.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set. This
goes some way towards mitigating fscache deadlocking against ext4 by way of
the allocator, eg:
INFO: task flush-8:0:24427 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0 D ffff88003e2b9fd8 0 24427 2 0x00000000
ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8
0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040
0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5
Call Trace:
[<ffffffff8106dcf5>] ? __lock_is_held+0x31/0x53
[<ffffffff81219b61>] ? radix_tree_lookup_element+0xf4/0x12a
[<ffffffff81454bed>] schedule+0x60/0x62
[<ffffffffa01d349c>] __fscache_wait_on_page_write+0x8b/0xa5 [fscache]
[<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
[<ffffffffa01d393a>] __fscache_maybe_release_page+0x30c/0x324 [fscache]
[<ffffffffa01d369a>] ? __fscache_maybe_release_page+0x6c/0x324 [fscache]
[<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
[<ffffffffa01fd7b2>] nfs_fscache_release_page+0x68/0x94 [nfs]
[<ffffffffa01ef73e>] nfs_release_page+0x7e/0x86 [nfs]
[<ffffffff810aa553>] try_to_release_page+0x32/0x3b
[<ffffffff810b6c70>] shrink_page_list+0x535/0x71a
[<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
[<ffffffff810b7352>] shrink_inactive_list+0x20a/0x2dd
[<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
[<ffffffff810b7a65>] shrink_lruvec+0x34c/0x3eb
[<ffffffff810b7bd3>] do_try_to_free_pages+0xcf/0x355
[<ffffffff810b7fc8>] try_to_free_pages+0x9a/0xa1
[<ffffffff810b08d2>] __alloc_pages_nodemask+0x494/0x6f7
[<ffffffff810d9a07>] kmem_getpages+0x58/0x155
[<ffffffff810dc002>] fallback_alloc+0x120/0x1f3
[<ffffffff8106db23>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff810dbed3>] ____cache_alloc_node+0x177/0x186
[<ffffffff81162a6c>] ? ext4_init_io_end+0x1c/0x37
[<ffffffff810dc403>] kmem_cache_alloc+0xf1/0x176
[<ffffffff810b17ac>] ? test_set_page_writeback+0x101/0x113
[<ffffffff81162a6c>] ext4_init_io_end+0x1c/0x37
[<ffffffff81162ce4>] ext4_bio_write_page+0x20f/0x3af
[<ffffffff8115cc02>] mpage_da_submit_io+0x26e/0x2f6
[<ffffffff811088e5>] ? __find_get_block_slow+0x38/0x133
[<ffffffff81161348>] mpage_da_map_and_submit+0x3a7/0x3bd
[<ffffffff81161a60>] ext4_da_writepages+0x30d/0x426
[<ffffffff810b3359>] do_writepages+0x1c/0x2a
[<ffffffff81102f4d>] __writeback_single_inode+0x3e/0xe5
[<ffffffff81103995>] writeback_sb_inodes+0x1bd/0x2f4
[<ffffffff81103b3b>] __writeback_inodes_wb+0x6f/0xb4
[<ffffffff81103c81>] wb_writeback+0x101/0x195
[<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
[<ffffffff811043aa>] ? wb_do_writeback+0xaa/0x173
[<ffffffff8110434a>] wb_do_writeback+0x4a/0x173
[<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff81038554>] ? del_timer+0x4b/0x5b
[<ffffffff811044e0>] bdi_writeback_thread+0x6d/0x147
[<ffffffff81104473>] ? wb_do_writeback+0x173/0x173
[<ffffffff81048fbc>] kthread+0xd0/0xd8
[<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
[<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
[<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
[<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
2 locks held by flush-8:0/24427:
#0: (&type->s_umount_key#41){.+.+..}, at: [<ffffffff810e3b73>] grab_super_passive+0x4c/0x76
#1: (jbd2_handle){+.+...}, at: [<ffffffff81190d81>] start_this_handle+0x475/0x4ea
The problem here is that another thread, which is attempting to write the
to-be-stored NFS page to the on-ext4 cache file is waiting for the journal
lock, eg:
INFO: task kworker/u:2:24437 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u:2 D ffff880039589768 0 24437 2 0x00000000
ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8
0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040
0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13
Call Trace:
[<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
[<ffffffff81455e73>] ? _raw_spin_unlock_irqrestore+0x3a/0x50
[<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
[<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff81454bed>] schedule+0x60/0x62
[<ffffffff81190c23>] start_this_handle+0x317/0x4ea
[<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
[<ffffffff81190fcc>] jbd2__journal_start+0xb3/0x12e
[<ffffffff81176606>] __ext4_journal_start_sb+0xb2/0xc6
[<ffffffff8115f137>] ext4_da_write_begin+0x109/0x233
[<ffffffff810a964d>] generic_file_buffered_write+0x11a/0x264
[<ffffffff811032cf>] ? __mark_inode_dirty+0x2d/0x1ee
[<ffffffff810ab1ab>] __generic_file_aio_write+0x2a5/0x2d5
[<ffffffff810ab24a>] generic_file_aio_write+0x6f/0xd0
[<ffffffff81159a2c>] ext4_file_write+0x38c/0x3c4
[<ffffffff810e0915>] do_sync_write+0x91/0xd1
[<ffffffffa00a17f0>] cachefiles_write_page+0x26f/0x310 [cachefiles]
[<ffffffffa01d470b>] fscache_write_op+0x21e/0x37a [fscache]
[<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
[<ffffffffa01d2479>] fscache_op_work_func+0x78/0xd7 [fscache]
[<ffffffff8104455a>] process_one_work+0x232/0x3a8
[<ffffffff810444ff>] ? process_one_work+0x1d7/0x3a8
[<ffffffff81044ee0>] worker_thread+0x214/0x303
[<ffffffff81044ccc>] ? manage_workers+0x245/0x245
[<ffffffff81048fbc>] kthread+0xd0/0xd8
[<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
[<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
[<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
[<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
4 locks held by kworker/u:2/24437:
#0: (fscache_operation){.+.+.+}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
#1: ((&op->work)){+.+.+.}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
#2: (sb_writers#14){.+.+.+}, at: [<ffffffff810ab22c>] generic_file_aio_write+0x51/0xd0
#3: (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [<ffffffff810ab236>] generic_file_aio_write+0x5b/0x
fscache already tries to cancel pending stores, but it can't cancel a write
for which I/O is already in progress.
An alternative would be to accept writing garbage to the cache under extreme
circumstances and to kill the afflicted cache object if we have to do this.
However, we really need to know how strapped the allocator is before deciding
to do that.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
Just some cleanup.
(And note the caller of this function may, for example, call vfs_unlink
on a child, so the "1" (I_MUTEX_PARENT) really was what was intended
here.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
The spinlock() within the condition in while() will cause a compile error
if it is not a function. This is not a problem on mainline but it does not
look pretty and there is no reason to do it that way.
That patch writes it a little differently and avoids the double condition.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
|
|
This patch looks at all the outstanding blocks in all the transactions
on the log, and moves the completed ones to the ail2 list. Then it
issues revokes for these blocks. This will hopefully speed things up
in situations where there is a lot of contention for glocks, especially
if they are acquired serially.
revoke_lo_before_commit will issue at most one log block's full of these
preemptive revokes. The amount of reserved log space that
gfs2_log_reserve() ignores has been incremented to allow for this extra
block.
This patch also consolidates the common revoke instructions into one
function, gfs2_add_revoke().
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Give them names that are a bit more consistent with the general
pNFS naming scheme.
- lo_seg_contained -> pnfs_lseg_range_contained
- lo_seg_intersecting -> pnfs_lseg_range_intersecting
- cmp_layout -> pnfs_lseg_range_cmp
- is_matching_lseg -> pnfs_lseg_range_match
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Also strip off the unnecessary 'inline' declarations.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
The other protocols don't use it, so make it local to NFSv4, and
remove the EXPORT.
Also ensure that we only compile in cache_lib.o if we're using
the legacy DNS resolver.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
|
|
Make sure that NFSv4 SETCLIENTID does not parse the NETID as a
format string.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
We want these fixes here too.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Changing size of a file on server and local update (fuse_write_update_size)
should be always protected by inode->i_mutex. Otherwise a race like this is
possible:
1. Process 'A' calls fallocate(2) to extend file (~FALLOC_FL_KEEP_SIZE).
fuse_file_fallocate() sends FUSE_FALLOCATE request to the server.
2. Process 'B' calls ftruncate(2) shrinking the file. fuse_do_setattr()
sends shrinking FUSE_SETATTR request to the server and updates local i_size
by i_size_write(inode, outarg.attr.size).
3. Process 'A' resumes execution of fuse_file_fallocate() and calls
fuse_write_update_size(inode, offset + length). But 'offset + length' was
obsoleted by ftruncate from previous step.
Changed in v2 (thanks Brian and Anand for suggestions):
- made relation between mutex_lock() and fuse_set_nowrite(inode) more
explicit and clear.
- updated patch description to use ftruncate(2) in example
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
Remove struct xfs_chash from struct xfs_mount as there is no user of
it nowadays.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
As per the mount man page, sunit and swidth can be changed via
mount options. For XFS, on the face of it, those options seems
works if the specified alignments is properly, e.g.
# mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt
# mount | grep sdb1
/dev/sdb1 on /mnt type xfs (rw,sunit=4096,swidth=8192)
However, neither sunit nor swidth is shown from the xfs_info output.
# xfs_info /mnt
meta-data=/dev/sdb1 isize=256 agcount=4, agsize=262144 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1048576, imaxpct=25
= sunit=0 swidth=0 blks
^^^^^^^^^^^^^^^^^^^^^^^^^^
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The reason is that the alignment can only be changed if the relevant
super block is already configured with alignments, otherwise, the
given value is silently ignored.
With this fix, the attempt to mount a storage without strip alignment
setup on a super block will get an error with a warning in syslog to
indicate the true cause, e.g.
# mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
.......
XFS (sdb1): cannot change alignment: superblock does not support data
alignment
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Tinguely <tinguely@sgi.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Commit eab4e633 "xfs: uncached buffer reads need to return an error".
Remove redundant error variable, using the function level error variable
to store bp->b_error instead.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
This patch removed several unused variables.
Signed-off-by: Jon Ernst <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Add the f2fs_remount function call which will be used
during the filesystem remounting. This function
will help us to change the mount options specific to
f2fs.
Also modify the f2fs background_gc mount option, which
will allow the user to dynamically trun on/off the
garbage collection in f2fs based on the background_gc
value. If background_gc=on, Garbage collection will
be turned off & if background_gc=off, Garbage collection
will be truned on.
By default the garbage collection is on in f2fs.
Change Log:
v2: Incorporated the review comments by Gu Zheng.
Removing the restore part for VFS flags
Updating comments with proper flag conditions
Display GC background option as ON/OFF
Revised conditions to stop GC in case of remount
v1: Initial changes for adding remount_fs callback
support.
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: change /** with /* for the coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
"Several fixes + obvious cleanup (you've missed a couple of open-coded
can_lookup() back then)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
snd_pcm_link(): fix a leak...
use can_lookup() instead of direct checks of ->i_op->lookup
move exit_task_namespaces() outside of exit_notify()
fput: task_work_add() can fail if the caller has passed exit_task_work()
ncpfs: fix rmdir returns Device or resource busy
|
|
Pull xfs fixes from Ben Myers:
- Remove noisy warnings about experimental support which spams the logs
- Add padding to align directory and attr structures correctly
- Set block number on child buffer on a root btree split
- Disable verifiers during log recovery for non-CRC filesystems
* tag 'for-linus-v3.10-rc6' of git://oss.sgi.com/xfs/xfs:
xfs: don't shutdown log recovery on validation errors
xfs: ensure btree root split sets blkno correctly
xfs: fix implicit padding in directory and attr CRC formats
xfs: don't emit v5 superblock warnings on write
|
|
a couple of places got missed back when Linus has introduced that one...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
fput() assumes that it can't be called after exit_task_work() but
this is not true, for example free_ipc_ns()->shm_destroy() can do
this. In this case fput() silently leaks the file.
Change it to fallback to delayed_fput_work if task_work_add() fails.
The patch looks complicated but it is not, it changes the code from
if (PF_KTHREAD) {
schedule_work(...);
return;
}
task_work_add(...)
to
if (!PF_KTHREAD) {
if (!task_work_add(...))
return;
/* fallback */
}
schedule_work(...);
As for shm_destroy() in particular, we could make another fix but I
think this change makes sense anyway. There could be another similar
user, it is not safe to assume that task_work_add() can't fail.
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
There doesn't appear to be any reason for the overall pstore RAM buffer to
be a power of 2 size, so remove it. The individual console, ftrace and oops
buffers are still a power of 2 size.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
|
|
For persistent RAM outside of main memory, the memory may have limitations
on supported accesses. For internal RAM on highbank platform exclusive
accesses are not supported and will hang the system. So atomic_cmpxchg
cannot be used. This commit uses spinlock protection for buffer size and
start updates on ioremapped regions instead.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
|
|
Unfortunately, we cannot guarantee that items logged multiple times
and replayed by log recovery do not take objects back in time. When
they are taken back in time, the go into an intermediate state which
is corrupt, and hence verification that occurs on this intermediate
state causes log recovery to abort with a corruption shutdown.
Instead of causing a shutdown and unmountable filesystem, don't
verify post-recovery items before they are written to disk. This is
less than optimal, but there is no way to detect this issue for
non-CRC filesystems If log recovery successfully completes, this
will be undone and the object will be consistent by subsequent
transactions that are replayed, so in most cases we don't need to
take drastic action.
For CRC enabled filesystems, leave the verifiers in place - we need
to call them to recalculate the CRCs on the objects anyway. This
recovery problem can be solved for such filesystems - we have a LSN
stamped in all metadata at writeback time that we can to determine
whether the item should be replayed or not. This is a separate piece
of work, so is not addressed by this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 9222a9cf86c0d64ffbedf567412b55da18763aa3)
|
|
For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a different code path on root splits than the
freespace and inode btrees. This is much less traversed by xfstests
than the other trees. When testing on a 1k block size filesystem,
I've been seeing ASSERT failures in generic/234 like:
XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317
which are generally preceded by a lblock check failure. I noticed
this in the bmbt stats:
$ pminfo -f xfs.btree.block_map
xfs.btree.block_map.lookup
value 39135
xfs.btree.block_map.compare
value 268432
xfs.btree.block_map.insrec
value 15786
xfs.btree.block_map.delrec
value 13884
xfs.btree.block_map.newroot
value 2
xfs.btree.block_map.killroot
value 0
.....
Very little coverage of root splits and merges. Indeed, on a 4k
filesystem, block_map.newroot and block_map.killroot are both zero.
i.e. the code is not exercised at all, and it's the only generic
btree infrastructure operation that is not exercised by a default run
of xfstests.
Turns out that on a 1k filesystem, generic/234 accounts for one of
those two root splits, and that is somewhat of a smoking gun. In
fact, it's the same problem we saw in the directory/attr code where
headers are memcpy()d from one block to another without updating the
self describing metadata.
Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit ade1335afef556df6538eb02e8c0dc91fbd9cc37)
|
|
Michael L. Semon has been testing CRC patches on a 32 bit system and
been seeing assert failures in the directory code from xfs/080.
Thanks to Michael's heroic efforts with printk debugging, we found
that the problem was that the last free space being left in the
directory structure was too small to fit a unused tag structure and
it was being corrupted and attempting to log a region out of bounds.
Hence the assert failure looked something like:
.....
#5 calling xfs_dir2_data_log_unused() 36 32
#1 4092 4095 4096
#2 8182 8183 4096
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
Where #1 showed the first region of the dup being logged (i.e. the
last 4 bytes of a directory buffer) and #2 shows the corrupt values
being calculated from the length of the dup entry which overflowed
the size of the buffer.
It turns out that the problem was not in the logging code, nor in
the freespace handling code. It is an initial condition bug that
only shows up on 32 bit systems. When a new buffer is initialised,
where's the freespace that is set up:
[ 172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
[ 172.316346] #9 calling xfs_dir2_data_log_unused()
[ 172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
[ 172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096
Note the offset of the first region being logged? It's 60 bytes into
the buffer. Once I saw that, I pretty much knew that the bug was
going to be caused by this.
Essentially, all direct entries are rounded to 8 bytes in length,
and all entries start with an 8 byte alignment. This means that we
can decode inplace as variables are naturally aligned. With the
directory data supposedly starting on a 8 byte boundary, and all
entries padded to 8 bytes, the minimum freespace in a directory
block is supposed to be 8 bytes, which is large enough to fit a
unused data entry structure (6 bytes in size). The fact we only have
4 bytes of free space indicates a directory data block alignment
problem.
And what do you know - there's an implicit hole in the directory
data block header for the CRC format, which means the header is 60
byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
padding. And while looking at the structures, I found the same
problem in the attr leaf header. Fix them both.
Note that this only affects 32 bit systems with CRCs enabled.
Everything else is just fine. Note that CRC enabled filesystems created
before this fix on such systems will not be readable with this fix
applied.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 8a1fd2950e1fe267e11fc8c85dcaa6b023b51b60)
|
|
We write the superblock every 30s or so which results in the
verifier being called. Right now that results in this output
every 30s:
XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
Use of these features in this kernel is at your own risk!
And spamming the logs.
We don't need to check for whether we support v5 superblocks or
whether there are feature bits we don't support set as these are
only relevant when we first mount the filesytem. i.e. on superblock
read. Hence for the write verification we can just skip all the
checks (and hence verbose output) altogether.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 34510185abeaa5be9b178a41c0a03d30aec3db7e)
|
|
Unfortunately, we cannot guarantee that items logged multiple times
and replayed by log recovery do not take objects back in time. When
they are taken back in time, the go into an intermediate state which
is corrupt, and hence verification that occurs on this intermediate
state causes log recovery to abort with a corruption shutdown.
Instead of causing a shutdown and unmountable filesystem, don't
verify post-recovery items before they are written to disk. This is
less than optimal, but there is no way to detect this issue for
non-CRC filesystems If log recovery successfully completes, this
will be undone and the object will be consistent by subsequent
transactions that are replayed, so in most cases we don't need to
take drastic action.
For CRC enabled filesystems, leave the verifiers in place - we need
to call them to recalculate the CRCs on the objects anyway. This
recovery problem can be solved for such filesystems - we have a LSN
stamped in all metadata at writeback time that we can to determine
whether the item should be replayed or not. This is a separate piece
of work, so is not addressed by this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
For TCP we disable Nagle and I cannot think of why it would be needed
for SCTP. When disabled it seems to improve dlm_lock operations like it
does for TCP.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Currently if a SCTP send fails, we lose the data we were trying
to send because the writequeue_entry is released when we do the send.
When this happens other nodes will then hang waiting for a reply.
This adds support for SCTP to retry the send operation.
I also removed the retry limit for SCTP use, because we want
to make sure we try every path during init time and for longer
failures we want to continually retry in case paths come back up
while trying other paths. We will do this until userspace tells us
to stop.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Currently, if we cannot create a association to the first IP addr
that is added to DLM, the SCTP init assoc code will just retry
the same IP. This patch adds a simple failover schemes where we
will try one of the addresses that was passed into DLM.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
|