summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2013-05-28f2fs: cover cp_file information with ilockJaegeuk Kim
If a file is linked with other files, it should be checkpointed at every fsync calls. For this, we use set_cp_file() with FADVISE_CP_BIT, but previously we didn't cover the flag by the global lock. This patch fixes that the inode page stores this correctly. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix incorrect iputs during the dentry recoveryJaegeuk Kim
- iget/iput flow in the dentry recovery process 1. *dir* = f2fs_iget 2. set FI_DELAY_IPUT to *dir* 3. add *dir* to the dirty_dir_list - __f2fs_add_link - recover_dentry) 4. iput *dir* by remove_dirty_dir_inode - sync_dirty_dir_inodes - write_chekcpoint If *dir*'s i_count is not 1 (i.e., root dir), remove_dirty_dir_inode is called later and then iput is triggered again due to the FI_DELAY_IPUT flag. So, let's unset the flag properly once iput is triggered. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix dentry recovery routineJaegeuk Kim
The error scenario is: 1. create /a (1.a link /a /b) 2. sync 3. unlinke /a 4. create /a 5. fsync /a 6. Sudden power-off When the f2fs recovers the fsynced dentry, /a, we discover an exsiting dentry at f2fs_find_entry() in recover_dentry(). In such the case, we should unlink the existing dentry and its inode and then recover newly created dentry. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: iput only if whole data blocks are flushedJaegeuk Kim
If there remains some unwritten blocks from the recovery, we should not call iput on that directory inode. Otherwise, we can loose some dentry blocks after the recovery. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: return proper error from start_gc_threadNamjae Jeon
when there is an error from kthread_run, then return proper error rather than returning -ENOMEM. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: optimize several routines in node.hNamjae Jeon
There are various functions with common code which could be separated out to make common routines. So, made new routines and in order to retain the same call path and no major changes, written some macros to access those routines. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: remove unneeded initializations in f2fs_parent_dirNamjae Jeon
There is no need to initialize few pointers in f2fs_parent_dir as the values are not checked and instead directly initialized values are used. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: push some variables to debug partNamjae Jeon
Some, counters are needed only for the statistical information while debugging. So, those can be controlled using CONFIG_F2FS_STAT_FS, pushing the usage for few variables under this flag. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: align data types between on-disk and in-memory block addressesJaegeuk Kim
The on-disk block address is defined as __le32, but in-memory block address, block_t, does as u64. Let's synchronize them to 32 bits. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: dereferencing an ERR_PTRDan Carpenter
There is an error path where "dir" is an ERR_PTR. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: use iholdJaegeuk Kim
Use the following helper function committed by Al. commit 7de9c6ee3ecffd99e1628e81a5ea5468f7581a1f Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Oct 23 11:11:40 2010 -0400 new helper: ihold() ... Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: should not make_bad_inode on f2fs_link failureJaegeuk Kim
If -ENOSPC is met during f2fs_link, we should not make the inode as bad. The inode is still alive. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix to handle do_recover_data errorsJaegeuk Kim
This patch adds error handling codes of check_index_in_prev_nodes and its caller, do_recover_data. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: reuse the locked dnode page and its inodeJaegeuk Kim
This patch fixes the following deadlock bug during the recovery. INFO: task mount:1322 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount D ffffffff81125870 0 1322 1266 0x00000000 ffff8801207e39d8 0000000000000046 ffff88012ab1dee0 0000000000000046 ffff8801207e3a08 ffff880115903f40 ffff8801207e3fd8 ffff8801207e3fd8 ffff8801207e3fd8 ffff880115903f40 ffff8801207e39d8 ffff88012fc94520 Call Trace: [<ffffffff81125870>] ? __lock_page+0x70/0x70 [<ffffffff816a92d9>] schedule+0x29/0x70 [<ffffffff816a93af>] io_schedule+0x8f/0xd0 [<ffffffff8112587e>] sleep_on_page+0xe/0x20 [<ffffffff816a649a>] __wait_on_bit_lock+0x5a/0xc0 [<ffffffff81125867>] __lock_page+0x67/0x70 [<ffffffff8106c7b0>] ? autoremove_wake_function+0x40/0x40 [<ffffffff81126857>] find_lock_page+0x67/0x80 [<ffffffff8112698f>] find_or_create_page+0x3f/0xb0 [<ffffffffa03901a8>] ? sync_inode_page+0xa8/0xd0 [f2fs] [<ffffffffa038fdf7>] get_node_page+0x67/0x180 [f2fs] [<ffffffffa039818b>] recover_fsync_data+0xacb/0xff0 [f2fs] [<ffffffff816aaa1e>] ? _raw_spin_unlock+0x3e/0x40 [<ffffffffa0389634>] f2fs_fill_super+0x7d4/0x850 [f2fs] [<ffffffff81184cf9>] mount_bdev+0x1c9/0x210 [<ffffffffa0388e60>] ? validate_superblock+0x180/0x180 [f2fs] [<ffffffffa0387635>] f2fs_mount+0x15/0x20 [f2fs] [<ffffffff81185a13>] mount_fs+0x43/0x1b0 [<ffffffff81145ba0>] ? __alloc_percpu+0x10/0x20 [<ffffffff811a0796>] vfs_kern_mount+0x76/0x120 [<ffffffff811a2cb7>] do_mount+0x237/0xa10 [<ffffffff81140b9b>] ? strndup_user+0x5b/0x80 [<ffffffff811a3520>] SyS_mount+0x90/0xe0 [<ffffffff816b3502>] system_call_fastpath+0x16/0x1b The bug is triggered when check_index_in_prev_nodes tries to get the direct node page by calling get_node_page. At this point, if the direct node page is already locked by get_dnode_of_data, its caller, we got a deadlock condition. This patch adds additional condition check for the reuse of locked direct node pages prior to the get_node_page call. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix wrong condition checkJaegeuk Kim
While an orphan inode has zero link_count, f2fs_gc is able to select the inode for foreground gc. - f2fs_gc - do_garbage_collect - gc_data_segment : f2fs_iget is failed : get_valid_blocks() != 0, so that retry --> here we got the infinite loop. This patch resolved this issue. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: add f2fs_readonly()Jaegeuk Kim
Introduce a simple macro function for readability. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: avoid RECLAIM_FS-ON-W: deadlockJaegeuk Kim
This patch tries to avoid the following deadlock condition of which the reclaim path can trigger f2fs_balance_fs again. ================================= [ INFO: inconsistent lock state ] --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/41 [HC0[0]:SC0[0]:HE1:SE1] takes: (&sbi->gc_mutex){+.+.?.}, at: f2fs_balance_fs+0xe6/0x100 [f2fs] {RECLAIM_FS-ON-W} state was registered at: [<ffffffff810aa5a9>] mark_held_locks+0xb9/0x140 [<ffffffff810aae85>] lockdep_trace_alloc+0x85/0xf0 [<ffffffff8113ab2c>] __alloc_pages_nodemask+0x7c/0x9b0 [<ffffffff81175aa8>] alloc_pages_current+0xb8/0x180 [<ffffffff811319cf>] __page_cache_alloc+0xaf/0xd0 [<ffffffff8113225c>] find_or_create_page+0x4c/0xb0 [<ffffffffa021359e>] find_data_page+0x14e/0x210 [f2fs] [<ffffffffa021161b>] f2fs_gc+0x9eb/0xd90 [f2fs] [<ffffffffa0218fae>] f2fs_balance_fs+0xee/0x100 [f2fs] [<ffffffffa020848c>] f2fs_setattr+0x6c/0x200 [f2fs] [<ffffffff811ae51b>] notify_change+0x1db/0x3a0 [<ffffffff8118fbd0>] do_truncate+0x60/0xa0 [<ffffffff8118fd95>] vfs_truncate+0x185/0x1b0 [<ffffffff8118fe1c>] do_sys_truncate+0x5c/0xa0 [<ffffffff8118ffee>] SyS_truncate+0xe/0x10 [<ffffffff816e2b42>] system_call_fastpath+0x16/0x1b Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: don't do checkpoint if error is occurredJaegeuk Kim
If we met an error during the dentry recovery, we should not conduct checkpoint. Otherwise, some errorneous dentry blocks overwrites the existing blocks that contain the remaining recovery information. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix to unlock page before exitJaegeuk Kim
If we got an error after lock_page, we should unlock it before exit. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: remove unnecessary kmap/kunmap operationsJaegeuk Kim
The allocated page used by the recovery is not on HIGHMEM, so that we don't need to use kmap/kunmap. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: reorganize f2fs_vm_page_mkwriteNamjae Jeon
Few things can be changed in the default mkwrite function 1) Make file_update_time at the start before acquiring any lock 2) the condition page_offset(page) >= i_size_read(inode) should be changed to page_offset(page) > i_size_read 3) Move wait_on_page_writeback. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: use list_for_each_entry rather than list_for_each_entry_safemajianpeng
We can do this, since now we use a global mutex, f2fs_stat_mutex to protect its list operations. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> [Jaegeuk Kim: add description] Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: remove unecessary variable and codeHaicheng Li
Code cleanup without behavior changed. Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs, lockdep: annotate mutex_lock_all()Peter Zijlstra
Majianpeng reported a lockdep splat for f2fs. It turns out mutex_lock_all() acquires an array of locks (in global/local lock style). Any such operation is always serialized using cp_mutex, therefore there is no fs_lock[] lock-order issue; tell lockdep about this using the mutex_lock_nest_lock() primitive. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: add debug msgs in the recovery routineJaegeuk Kim
This patch adds some trivial debugging messages in the recovery process. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: update inode page after creationJaegeuk Kim
I found a bug when testing power-off-recovery as follows. [Bug Scenario] 1. create a file 2. fsync the file 3. reboot w/o any sync 4. try to recover the file - found its fsync mark - found its dentry mark : try to recover its dentry - get its file name - get its parent inode number : here we got zero value The reason why we get the wrong parent inode number is that we didn't synchronize the inode page with its newly created inode information perfectly. Especially, previous f2fs stores fi->i_pino and writes it to the cached node page in a wrong order, which incurs the zero-valued i_pino during the recovery. So, this patch modifies the creation flow to fix the synchronization order of inode page with its inode. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: change get_new_data_page to pass a locked node pageJaegeuk Kim
This patch is for passing a locked node page to get_dnode_of_data. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: skip get_node_page if locked node page is passedJaegeuk Kim
If get_dnode_of_data gets a locked node page, let's skip redundant get_node_page calls. This is for the futher enhancement. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: remove unnecessary por_doing checkJaegeuk Kim
This por_doing check is totally not related to the recovery process. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix BUG_ON during f2fs_evict_inode(dir)Jaegeuk Kim
During the dentry recovery routine, recover_inode() triggers __f2fs_add_link with its directory inode. In the following scenario, a bug is captured. 1. dir = f2fs_iget(pino) 2. __f2fs_add_link(dir, name) 3. iput(dir) -> f2fs_evict_inode() faces with BUG_ON(atomic_read(fi->dirty_dents)) Kernel BUG at ffffffffa01c0676 [verbose debug info unavailable] [<ffffffffa01c0676>] f2fs_evict_inode+0x276/0x300 [f2fs] Call Trace: [<ffffffff8118ea00>] evict+0xb0/0x1b0 [<ffffffff8118f1c5>] iput+0x105/0x190 [<ffffffffa01d2dac>] recover_fsync_data+0x3bc/0x1070 [f2fs] [<ffffffff81692e8a>] ? io_schedule+0xaa/0xd0 [<ffffffff81690acb>] ? __wait_on_bit_lock+0x7b/0xc0 [<ffffffff8111a0e7>] ? __lock_page+0x67/0x70 [<ffffffff81165e21>] ? kmem_cache_alloc+0x31/0x140 [<ffffffff8118a502>] ? __d_instantiate+0x92/0xf0 [<ffffffff812a949b>] ? security_d_instantiate+0x1b/0x30 [<ffffffff8118a5b4>] ? d_instantiate+0x54/0x70 This means that we should flush all the dentry pages between iget and iput(). But, during the recovery routine, it is unallowed due to consistency, so we have to wait the whole recovery process. And then, write_checkpoint flushes all the dirty dentry blocks, and nicely we can put the stale dir inodes from the dirty_dir_inode_list. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix por_doing variable coverageJaegeuk Kim
The reason of using sbi->por_doing is to alleviate data writes during the recovery. The find_fsync_dnodes() produces some dirty dentry pages, so we should cover it too with sbi->por_doing. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: remove redundant assignmentJaegeuk Kim
We don't need to assign a value redundantly. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix the inconsistent state of data pagesJaegeuk Kim
In get_lock_data_page, if there is a data race between get_dnode_of_data for node and grab_cache_page for data, f2fs is able to face with the following BUG_ON(dn.data_blkaddr == NEW_ADDR). kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/data.c:251! [<ffffffffa044966c>] get_lock_data_page+0x1ec/0x210 [f2fs] Call Trace: [<ffffffffa043b089>] f2fs_readdir+0x89/0x210 [f2fs] [<ffffffff811a0920>] ? fillonedir+0x100/0x100 [<ffffffff811a0920>] ? fillonedir+0x100/0x100 [<ffffffff811a07f8>] vfs_readdir+0xb8/0xe0 [<ffffffff811a0b4f>] sys_getdents+0x8f/0x110 [<ffffffff816d7999>] system_call_fastpath+0x16/0x1b This bug is able to be occurred when the block address of the data block is changed after f2fs_put_dnode(). In order to avoid that, this patch fixes the lock order of node and data blocks in which the node block lock is covered by the data block lock. Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28f2fs: fix inconsistency of block count during recoveryJaegeuk Kim
Currently f2fs recovers the dentry of fsynced files. When power-off-recovery is conducted, this newly recovered inode should increase node block count as well as inode block count. This patch resolves this inconsistency that results in: 1. create a file 2. write data 3. fsync 4. reboot without sync 5. mount and recover the file 6. node block count is 1 and inode block count is 2 : fall into the inconsistent state 7. unlink the file : trigger the following BUG_ON ------------[ cut here ]------------ kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/f2fs.h:716! Call Trace: [<ffffffffa0344100>] ? get_node_page+0x50/0x1a0 [f2fs] [<ffffffffa0344bfc>] remove_inode_page+0x8c/0x100 [f2fs] [<ffffffffa03380f0>] ? f2fs_evict_inode+0x180/0x2d0 [f2fs] [<ffffffffa033812e>] f2fs_evict_inode+0x1be/0x2d0 [f2fs] [<ffffffff811c7a67>] evict+0xa7/0x1a0 [<ffffffff811c82b5>] iput+0x105/0x190 [<ffffffff811c2b30>] d_kill+0xe0/0x120 [<ffffffff811c2c57>] dput+0xe7/0x1e0 [<ffffffff811acc3d>] __fput+0x19d/0x2d0 [<ffffffff811acd7e>] ____fput+0xe/0x10 [<ffffffff81070645>] task_work_run+0xb5/0xe0 [<ffffffff81002941>] do_notify_resume+0x71/0xb0 [<ffffffff8175f14a>] int_signal+0x12/0x17 Reported-and-Tested-by: Chris Fries <C.Fries@motorola.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-27ext4: make punch hole code path work with bigallocLukas Czerner
Currently punch hole is disabled in file systems with bigalloc feature enabled. However the recent changes in punch hole patch should make it easier to support punching holes on bigalloc enabled file systems. This commit changes partial_cluster handling in ext4_remove_blocks(), ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently partial_cluster is unsigned long long type and it makes sure that we will free the partial cluster if all extents has been released from that cluster. However it has been specifically designed only for truncate. With punch hole we can be freeing just some extents in the cluster leaving the rest untouched. So we have to make sure that we will notice cluster which still has some extents. To do this I've changed partial_cluster to be signed long long type. The only scenario where this could be a problem is when cluster_size == block size, however in that case there would not be any partial clusters so we're safe. For bigger clusters the signed type is enough. Now we use the negative value in partial_cluster to mark such cluster used, hence we know that we must not free it even if all other extents has been freed from such cluster. This scenario can be described in simple diagram: |FFF...FF..FF.UUU| ^----------^ punch hole . - free space | - cluster boundary F - freed extent U - used extent Also update respective tracepoints to use signed long long type for partial_cluster. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: update ext4_ext_remove_space trace pointLukas Czerner
Add "end" variable. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: remove unused code from ext4_remove_blocks()Lukas Czerner
The "head removal" branch in the condition is never used in any code path in ext4 since the function only caller ext4_ext_rm_leaf() will make sure that the extent is properly split before removing blocks. Note that there is a bug in this branch anyway. This commit removes the unused code completely and makes use of ext4_error() instead of printk if dubious range is provided. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: remove unused discard_partial_page_buffersLukas Czerner
The discard_partial_page_buffers is no longer used anywhere so we can simply remove it including the *_no_lock variant and EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: use ext4_zero_partial_blocks in punch_holeLukas Czerner
We're doing to get rid of ext4_discard_partial_page_buffers() since it is duplicating some code and also partially duplicating work of truncate_pagecache_range(), moreover the old implementation was much clearer. Now when the truncate_inode_pages_range() can handle truncating non page aligned regions we can use this to invalidate and zero out block aligned region of the punched out range and then use ext4_block_truncate_page() to zero the unaligned blocks on the start and end of the range. This will greatly simplify the punch hole code. Moreover after this commit we can get rid of the ext4_discard_partial_page_buffers() completely. We also introduce function ext4_prepare_punch_hole() to do come common operations before we attempt to do the actual punch hole on indirect or extent file which saves us some code duplication. This has been tested on ppc64 with 1k block size with fsx and xfstests without any problems. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: truncate_inode_pages() in orphan cleanup pathLukas Czerner
Currently we do not tell mm to zero out tail of the page before truncate in orphan_cleanup(). This is ok, because the page should not be uptodate, however this may eventually change and I might cause problems. Call truncate_inode_pages() as precautionary measure. Thanks Jan Kara for pointing this out. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27Revert "ext4: fix fsx truncate failure"Lukas Czerner
This reverts commit 189e868fa8fdca702eb9db9d8afc46b5cb9144c9. This commit reintroduces the use of ext4_block_truncate_page() in ext4 truncate operation instead of ext4_discard_partial_page_buffers(). The statement in the commit description that the truncate operation only zero block unaligned portion of the last page is not exactly right, since truncate_pagecache_range() also zeroes and invalidate the unaligned portion of the page. Then there is no need to zero and unmap it once more and ext4_block_truncate_page() was doing the right job, although we still need to update the buffer head containing the last block, which is exactly what ext4_block_truncate_page() is doing. Moreover the problem described in the commit is fixed more properly with commit 15291164b22a357cb211b618adfef4fa82fc0de3 jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer This was tested on ppc64 machine with block size of 1024 bytes without any problems. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: Call ext4_jbd2_file_inode() after zeroing blockLukas Czerner
In data=ordered mode we should call ext4_jbd2_file_inode() so that crash after the truncate transaction has committed does not expose stall data in the tail of the block. Thanks Jan Kara for pointing that out. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27Revert "ext4: remove no longer used functions in inode.c"Lukas Czerner
This reverts commit ccb4d7af914e0fe9b2f1022f8ea6c300463fd5e6. This commit reintroduces functions ext4_block_truncate_page() and ext4_block_zero_page_range() which has been previously removed in favour of ext4_discard_partial_page_buffers(). In future commits we want to reintroduce those function and remove ext4_discard_partial_page_buffers() since it is duplicating some code and also partially duplicating work of truncate_pagecache_range(), moreover the old implementation was much clearer. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27Merge 3.10-rc3 into driver-core-nextGreg Kroah-Hartman
We want the changes here. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-26Merge tag 'nfs-for-3.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - Stable fix to prevent an rpc_task wakeup race - Fix a NFSv4.1 session drain deadlock - Fix a NFSv4/v4.1 mount regression when not running rpc.gssd - Ensure auth_gss pipe detection works in namespaces - Fix SETCLIENTID fallback if rpcsec_gss is not available * tag 'nfs-for-3.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix SETCLIENTID fallback if GSS is not available SUNRPC: Prevent an rpc_task wakeup race NFSv4.1 Fix a pNFS session draining deadlock SUNRPC: Convert auth_gss pipe detection to work in namespaces SUNRPC: Faster detection if gssd is actually running SUNRPC: Fix a bug in gss_create_upcall
2013-05-26Merge tag 'for-linus-v3.10-rc3' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs fixes from Ben Myers: "Here are fixes for corruption on 512 byte filesystems, a rounding error, a use-after-free, some flags to fix lockdep reports, and several fixes related to CRCs. We have a somewhat larger post -rc1 queue than usual due to fixes related to the CRC feature we merged for 3.10: - Fix for corruption with FSX on 512 byte blocksize filesystems - Fix rounding error in xfs_free_file_space - Fix use-after-free with extent free intents - Add several missing KM_NOFS flags to fix lockdep reports - Several fixes for CRC related code" * tag 'for-linus-v3.10-rc3' of git://oss.sgi.com/xfs/xfs: xfs: remote attribute lookups require the value length xfs: xfs_attr_shortform_allfit() does not handle attr3 format. xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC xfs: fix missing KM_NOFS tags to keep lockdep happy xfs: Don't reference the EFI after it is freed xfs: fix rounding in xfs_free_file_space xfs: fix sub-page blocksize data integrity writes
2013-05-24aio: fix kioctx not being freed after cancellation at exit timeBenjamin LaHaise
The recent changes overhauling fs/aio.c introduced a bug that results in the kioctx not being freed when outstanding kiocbs are cancelled at exit_aio() time. Specifically, a kiocb that is cancelled has its completion events discarded by batch_complete_aio(), which then fails to wake up the process stuck in free_ioctx(). Fix this by modifying the wait_event() condition in free_ioctx() appropriately. This patch was tested with the cancel operation in the thread based code posted yesterday. [akpm@linux-foundation.org: fix build] Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Zach Brown <zab@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24ocfs2: goto out_unlock if ocfs2_get_clusters_nocache() failed in ocfs2_fiemap()Joseph Qi
Last time we found there is lock/unlock bug in ocfs2_file_aio_write, and then we did a thorough search for all lock resources in ocfs2_inode_info, including rw, inode and open lockres and found this bug. My kernel version is 3.0.13, and it is also in the lastest version 3.9. In ocfs2_fiemap, once ocfs2_get_clusters_nocache failed, it should goto out_unlock instead of out, because we need release buffer head, up read alloc sem and unlock inode. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Acked-by: Sunil Mushran <sunil.mushran@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundaryRyusuke Konishi
nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary DESCRIPTION: There are use-cases when NILFS2 file system (formatted with block size lesser than 4 KB) can be remounted in RO mode because of encountering of "broken bmap" issue. The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>: "The machine I've been trialling nilfs on is running Debian Testing, Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've also reproduced it (identically) with Debian Unstable amd64 and Debian Experimental (using the 3.8-trunk kernel). The problematic partitions were formatted with "mkfs.nilfs2 -b 1024 -B 8192"." SYMPTOMS: (1) System log contains error messages likewise: [63102.496756] nilfs_direct_assign: invalid pointer: 0 [63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28) [63102.496798] [63102.524403] Remounting filesystem read-only (2) The NILFS2 file system is remounted in RO mode. REPRODUSING PATH: (1) Create volume group with name "unencrypted" by means of vgcreate utility. (2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>): ----------------[BEGIN SCRIPT]-------------------- VG=unencrypted lvcreate --size 2G --name ntest $VG mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest mkdir /var/tmp/n mkdir /var/tmp/n/ntest mount /dev/mapper/$VG-ntest /var/tmp/n/ntest mkdir /var/tmp/n/ntest/thedir cd /var/tmp/n/ntest/thedir sleep 2 date darcs init sleep 2 dmesg|tail -n 5 date darcs whatsnew || true date sleep 2 dmesg|tail -n 5 ----------------[END SCRIPT]-------------------- REPRODUCIBILITY: 100% INVESTIGATION: As it was discovered, the issue takes place during segment construction after executing such sequence of user-space operations: open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7 fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 ftruncate(7, 60) The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)" takes place because of trying to get block number for third block of the file with logical offset #3072 bytes. As it is possible to see from above output, the file has 60 bytes of the whole size. So, it is enough one block (1 KB in size) allocation for the whole file. Trying to operate with several blocks instead of one takes place because of discovering several dirty buffers for this file in nilfs_segctor_scan_file() method. The root cause of this issue is in nilfs_set_page_dirty function which is called just before writing to an mmapped page. When nilfs_page_mkwrite function handles a page at EOF boundary, it fills hole blocks only inside EOF through __block_page_mkwrite(). The __block_page_mkwrite() function calls set_page_dirty() after filling hole blocks, thus nilfs_set_page_dirty function (= a_ops->set_page_dirty) is called. However, the current implementation of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page at EOF boundary. As a result, buffers outside EOF are inconsistently marked dirty and queued for write even though they are not mapped with nilfs_get_block function. FIX: This modifies nilfs_set_page_dirty() not to mark hole blocks dirty. Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals for this issue. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk> Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24aio: fix io_getevents documentationJeff Moyer
In reviewing man pages, I noticed that io_getevents is documented to update the timeout that gets passed into the library call. This doesn't happen in kernel space or in the library (even though it's documented to do so in both places). Unless there is objection, I'd like to fix the comments/docs to match the code (I will also update the man page upon consensus). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Cyril Hrubis <chrubis@suse.cz> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>