Age | Commit message (Collapse) | Author |
|
Commit 0227d6abb378 ("fs/cachefiles: replace kerror by pr_err") didn't
include newline featuring in original kerror definition
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reported-by: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org> [3.16.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before
VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap. This
bug was introduced in commit 68b5a6524856 ("mm: softdirty: respect
VM_SOFTDIRTY in PTE holes"). That commit made /proc/pid/pagemap look at
VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs
returned by find_vma.
Tested:
Wrote a selftest that creates a PMD-sized VMA then unmaps the first
page and asserts that the page is not softdirty. I'm going to send the
pagemap selftest in a later commit.
Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jamie Liu <jamieliu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is a deadlock case which reported by Guozhonghua:
https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html
This case is caused by &res->spinlock and &dlm->master_lock
misordering in different threads.
It was introduced by commit 8d400b81cc83 ("ocfs2/dlm: Clean up refmap
helpers"). Since lockres is new, it doesn't not require the
&res->spinlock. So remove it.
Fixes: 8d400b81cc83 ("ocfs2/dlm: Clean up refmap helpers")
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system. It is easily
reproducible with the following script:
----------------[BEGIN SCRIPT]--------------------
mkfs.nilfs2 -f /dev/sdb
mount /dev/sdb /mnt
dd if=/dev/zero bs=1M count=30 of=/mnt/testfile
umount /mnt
mount /dev/sdb /mnt
CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"
/root/mmaptest/mmaptest /mnt/testfile 30 10 5
sync
CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
umount /mnt
mount /dev/sdb /mnt
CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
umount /mnt
echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
echo "AFTER MMAP:\t$CHECKSUM_AFTER"
echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
----------------[END SCRIPT]--------------------
The mmaptest tool looks something like this (very simplified, with
error checking removed):
----------------[BEGIN mmaptest]--------------------
data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, file_offset);
for (i = 0; i < write_count; ++i) {
memcpy(data + i * 4096, buf, sizeof(buf));
msync(data, file_size - file_offset, MS_SYNC))
}
----------------[END mmaptest]--------------------
The output of the script looks something like this:
BEFORE MMAP: 281ed1d5ae50e8419f9b978aab16de83 /mnt/testfile
AFTER MMAP: 6604a1c31f10780331a6850371b3a313 /mnt/testfile
AFTER REMOUNT: 281ed1d5ae50e8419f9b978aab16de83 /mnt/testfile
So it is clear, that the changes done using mmap() do not survive a
remount. This can be reproduced a 100% of the time. The problem was
introduced in commit 136e8770cd5d ("nilfs2: fix issue of
nilfs_set_page_dirty() for page at EOF boundary").
If the page was read with mpage_readpage() or mpage_readpages() for
example, then it has no buffers attached to it. In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.
This patch fixes the problem by also calling nilfs_set_file_dirty() if the
page has no buffers attached to it.
[akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
osb->vol_label is malloced in ocfs2_initialize_super but not freed if
error occurs or during umount, thus causing a memory leak.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I found the dump messages of UBIFS_SB_NODE is not aligned. This
patch remove the extra space from the line which is retracted.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
|
|
When fabricating a server index key for fscache, we should clear the index key
buffer before starting to fill it in, not in the middle.
Reported-by: James Pearson <james-p@moving-picture.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Silence a few warnings about missing symbols that are due to missing
includes of nfs3_fs.h.
Fixes: 00a36a1090350 (NFS: Move v3 declarations out of internal.h)
Fixes: cb8c20fa53ec2 (NFS: Move NFS v3 acl functions to nfs3_fs.h)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
- it doesn't hurt for kswapd to block occasionally. If it doesn't
want to block it would clear __GFP_WAIT. The current_is_kswapd()
was only added to avoid deadlocks and we have a new approach for
that.
- memory allocation in the SUNRPC layer can very rarely try to
->releasepage() a page it is trying to handle. The deadlock
is removed as nfs_release_page() doesn't block indefinitely.
So we don't need to set PF_FSTRANS for sunrpc network operations any
more.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive. I have
seen delays of as much as 208 seconds.
To keep the delay to one second, mark the bdi as write-congested
if the commit didn't finished. Once it does finish, the
write-congested flag will be cleared by nfs_commit_release_pages().
With this, the longest total delay in try_to_free_pages that I have
seen is under 3 seconds. With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.
If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service. If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.
nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().
This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second. If the commit doesn't complete
in this time, nfs_release_page() will fail. This means it might now
fail in some cases where it wouldn't before. These cases are only
when 'gfp' includes '__GFP_WAIT'.
nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
- page_cache_pipe_buf_steal() in splice.c
- shrink_page_list() in vmscan.c
- invalidate_inode_pages2_range() in truncate.c
The first two handle failure quite safely. The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.
So aborting if the commit takes longer than 1 second is perfectly safe.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
commit b31268ac793fd300da66b9c28bbf0a200339ab96
FS: Use stable writes when not doing a bulk flush
was a bit heavy handed.
The particular problem that lead to this patch was that
small writes to an O_SYNC file we being written as UNSTABLE writes
followed by a commit.
This is appropriate for large writes (which require multiple NFS
requests) but for small writes (single NFS request), using
NFS_FILE_SYNC is more efficient.
So that patch causes the code to select between the two methods
depending on how many nfs requests get generated.
Unfortunately this ends up applying to non O_SYNC writes as well.
In particular if you memory-map a file and update random pages, then
when they are eventually written out by writeback they will go as
NFS_FILE_SYNC. This is inefficient and slows down the application.
So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL.
With this patch:
O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE
followed by COMMIT for multiple requests
Writing immediately before close of fsync follow the same pattern.
Non-O_SYNC writes without an fsync of close eventually get flushed
out as UNSTABLE and a commit follows eventually as appropriate.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Currently asynchronous NFSv4 request will be retried with
exponential timeout (from 1/10 to 15 seconds), but async
requests will always use a 15second retry.
Some "async" requests are really synchronous though. The
async mechanism is used to allow the request to continue if
the requesting process is killed.
In those cases, an exponential retry is appropriate.
For example, if two different clients both open a file and
get a READ delegation, and one client then unlinks the file
(while still holding an open file descriptor), that unlink
will used the "silly-rename" handling which is async.
The first rename will result in NFS4ERR_DELAY while the
delegation is reclaimed from the other client. The rename
will not be retried for 15 seconds, causing an unlink to take
15 seconds rather than 100msec.
This patch only added exponential timeout for async unlink and
async rename. Other async calls, such as 'close' are sometimes
waited for so they might benefit from exponential timeout too.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If rpc.statd is restarted, upcalls to monitor hosts can fail with
ECONNREFUSED. In that case force a lookup of statd's new port and retry the
upcall.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Commit c9fdeb28 removed a 'continue' after checking if the lease needs
to be renewed. However, if client hasn't moved, the code falls down to
starting reboot recovery erroneously (ie., sends open reclaim and gets
back stale_clientid error) before recovering from getting stale_clientid
on the renew operation.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: c9fdeb280b8c (NFS: Add basic migration support to state manager thread)
Cc: stable@vger.kernel.org # 3.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Commit 65b38851a174
("NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes")
updated the following function:
static int nfs_volume_list_open(struct inode *inode, struct file *file)
it used &nfs_server_list_ops instead of &nfs_volume_list_ops
which means cat /proc/fs/nfsfs/volumes = /proc/fs/nfsfs/servers
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Fixes: 65b38851a174 (NFS: Fix /proc/fs/nfsfs/servers and...)
Cc: stable@vger.kernel.org # 3.4.x+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.
Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.
This patch adds @flags to percpu_ref_init() and implements the
following flags.
* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained
These flags should be able to serve the above two use cases.
v2: target_core_tpg.c conversion was missing. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
|
|
Previously, f2fs activates SSR if the # of free segments reaches to the # of
overprovisioned segments.
In this case, SSR starts to use dirty segments only, so that the overprovisoned
space cannot be selected for new data.
This means that we have no chance to utilizae the overprovisioned space at all.
This patch fixes that by allowing LFS allocations until the # of free segments
reaches to the last threshold, reserved space.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch changes the ipu_policy setting to use any combination of orthogonal policies.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In ->get_victim we get max_search value from dirty_i->nr_dirty without
protection of seglist_lock, after that, nr_dirty can be increased/decreased
before we hold seglist_lock lock.
Then in main loop we attempt to traverse all dirty section one time to find
victim section, but it's not accurate to use max_search as the total loop count,
because we might lose checking several sections or check sections redundantly
for the case of nr_dirty are increased or decreased previously.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In manual of mount, we descript remount as below:
"mount -o remount,rw /dev/foo /dir
After this call all old mount options are replaced and arbitrary stuff from
fstab is ignored, except the loop= option which is internally generated and
maintained by the mount command."
Previously f2fs do not clear up old mount options when remount_fs, so we have no
chance of disabling previous option (e.g. flush_merge). Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Now punching hole in directory is not supported in f2fs, so let's limit file
type in punch_hole().
In addition, in punch_hole if offset is exceed file size, we should skip
punching hole.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Block size in f2fs is 4096 bytes, so theoretically, f2fs can support 4096 bytes
sector device at maximum. But now f2fs only support 512 bytes size sector, so
block device such as zRAM which uses page cache as its block storage space will
not be mounted successfully as mismatch between sector size of zRAM and sector
size of f2fs supported.
In this patch we support large sector size in f2fs, so block device with sector
size of 512/1024/2048/4096 bytes can be supported in f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past
EOF without changing i_size of inode. These blocks past EOF will not be
truncated in ->setattr as we truncate them only when change the file size.
We should give a chance to truncate blocks out of filesize in setattr().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path,
we should update i_size at the changed time to update its inode page.
Otherwise, we can get wrong i_size after roll-forward recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch cleans up a simple macro.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If same data is updated multiple times, we don't need to redo whole the
operations.
Let's just update the lastest one.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In f2fs_sync_file, if there is no written appended writes, it skips
to write its node blocks.
But, if there is up-to-date inode page, we should write it to update
its metadata during the roll-forward recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We can summarize the roll forward recovery scenarios as follows.
[Term] F: fsync_mark, D: dentry_mark
1. inode(x) | CP | inode(x) | dnode(F)
-> Update the latest inode(x).
2. inode(x) | CP | inode(F) | dnode(F)
-> No problem.
3. inode(x) | CP | dnode(F) | inode(x)
-> Recover to the latest dnode(F), and drop the last inode(x)
4. inode(x) | CP | dnode(F) | inode(F)
-> No problem.
5. CP | inode(x) | dnode(F)
-> The inode(DF) was missing. Should drop this dnode(F).
6. CP | inode(DF) | dnode(F)
-> No problem.
7. CP | dnode(F) | inode(DF)
-> If f2fs_iget fails, then goto next to find inode(DF).
8. CP | dnode(F) | inode(x)
-> If f2fs_iget fails, then goto next to find inode(DF).
But it will fail due to no inode(DF).
So, this patch adds some missing points such as #1, #5, #7, and #8.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch revisited whole the recovery information during the f2fs_sync_file.
In this patch, there are three information to make a decision.
a) IS_CHECKPOINTED, /* is it checkpointed before? */
b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */
c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */
And, the scenarios for our rule are based on:
[Term] F: fsync_mark, D: dentry_mark
1. inode(x) | CP | inode(x) | dnode(F)
2. inode(x) | CP | inode(F) | dnode(F)
3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
4. inode(x) | CP | dnode(F) | inode(F)
5. CP | inode(x) | dnode(F) | inode(DF)
6. CP | inode(DF) | dnode(F)
7. CP | dnode(F) | inode(DF)
8. CP | dnode(F) | inode(x) | inode(DF)
For example, #3, the three conditions should be changed as follows.
inode(x) | CP | dnode(F) | inode(x) | inode(F)
a) x o o o o
b) x x x x o
c) x o o x o
If f2fs_sync_file stops ------^,
it should write inode(F) --------------^
So, the need_inode_block_update should return true, since
c) get_nat_flag(e, HAS_LAST_FSYNC), is false.
For example, #8,
CP | alloc | dnode(F) | inode(x) | inode(DF)
a) o x x x x
b) x x x o
c) o o x o
If f2fs_sync_file stops -------^,
it should write inode(DF) --------------^
Note that, the roll-forward policy should follow this rule, which means,
if there are any missing blocks, we doesn't need to recover that inode.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch introduces a flag in the nat entry structure to merge various
information such as checkpointed and fsync_done marks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Previously, all the dnode pages should be read during the roll-forward recovery.
Even worsely, whole the chain was traversed twice.
This patch removes that redundant and costly read operations by using page cache
of meta_inode and readahead function as well.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
|
|
On a sub-page sized filesystem, truncating a mapped region down
leaves us in a world of hurt. We truncate the pagecache, zeroing the
newly unused tail, then punch blocks out from under the page. If we
then truncate the file back up immediately, we expose that unmapped
hole to a dirty page mapped into the user application, and that's
where it all goes wrong.
In truncating the page cache, we avoid unmapping the tail page of
the cache because it still contains valid data. The problem is that
it also contains a hole after the truncate, but nobody told the mm
subsystem that. Therefore, if the page is dirty before the truncate,
we'll never get a .page_mkwrite callout after we extend the file and
the application writes data into the hole on the page. Hence when
we come to writing that region of the page, it has no blocks and no
delayed allocation reservation and hence we toss the data away.
This patch adds code to the truncate up case to solve it, by
ensuring the partial page at the old EOF is always cleaned after we
do any zeroing and move the EOF upwards. We can't actually serialise
the page writeback and truncate against page faults (yes, that
problem AGAIN) so this is really just a best effort and assumes it
is extremely unlikely that someone is concurrently writing to the
page at the EOF while extending the file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Fix sparse warning introduced by commit 4ef897a ("xfs: flush both
inodes in xfs_swap_extents").
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Fix sparse warning introduced by commit ac8809f9 ("xfs: abort
metadata writeback on permanent errors").
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Fix sparse warning introduced by commit afabfd3 ("xfs: combine
xfs_rtmodify_summary and xfs_rtget_summary").
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_quota.h was included twice.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_dir3_data_get_ftype() gets the file type off disk, but ASSERTs
if it's invalid:
ASSERT(type < XFS_DIR3_FT_MAX);
We shouldn't ASSERT on bad values read from disk. V3 dirs are
CRC-protected, but V2 dirs + ftype are not.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
When running a tight mount/unmount loop on an older kernel, RedHat
QE found that unmount would occasionally hang in
xfs_buf_unpin_wait() on the superblock buffer. Tracing and other
debug work by Eric Sandeen indicated that it was hanging on the
writing of the superblock during unmount immediately after logging
the superblock counters in a synchronous transaction. Further debug
indicated that the synchronous transaction was not waiting for
completion correctly, and we narrowed it down to
xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing
the transaction in the iclog buffer to disk correctly.
While this unmount superblock write code is now very different in
mainline kernels, the xlog_cil_force_lsn() code is identical, and it
was bisected to the backport of commit f876e44 ("xfs: always do log
forces via the workqueue"). This commit made the CIL push
asynchronous for log forces and hence exposed a race condition that
couldn't occur on a synchronous push.
Essentially, the xlog_cil_force_lsn() relied implicitly on the fact
that the sequence push would be complete by the time
xlog_cil_push_now() returned, resulting in the context being pushed
being in the committing list. When it was made asynchronous, it was
recognised that there was a race condition in detecting whether an
asynchronous push has started or not and code was added to handle
it.
Unfortunately, the fix was not quite right and left a race condition
where it it would detect an empty CIL while a push was in progress
before the context had been added to the committing list. This was
incorrectly seen as a "nothing to do" condition and so would tell
xfs_log_force_lsn() that there is nothing to wait for, and hence it
would push the iclogbufs in memory.
The fix is simple, but explaining the logic and the race condition
is a lot more complex. The fix is to add the context to the
committing list before we start emptying the CIL. This allows us to
detect the difference between an empty "do nothing" push and a push
that has not started by adding a discrete "emptying the CIL" state
to avoid the transient, incorrect "empty" condition that the
(unchanged) waiting code was seeing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
xfs_free_file_space() only affects the range of the file for which space
is being freed. It currently writes and truncates the page cache from
the start offset of the free to EOF.
Modify xfs_free_file_space() to write back and truncate page cache of
just the range being freed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The collapse range operation currently writes the entire file before
starting the collapse to avoid changes in the in-core extent list due to
writeback causing the extent count to change. Now that collapse range is
fsb based rather than extent index based it can sustain changes in the
extent list during the shift sequence without disruption.
Modify xfs_collapse_file_space() to writeback and invalidate pages
associated with the range of the file to be shifted.
xfs_free_file_space() currently has similar behavior, but the space free
need only affect the region of the file that is freed and this could
change in the future.
Also update the comments to reflect the current implementation. We
retain the eofblocks trim permanently as a best option for dealing with
delalloc extents. We don't shift delalloc extents because this scenario
only occurs with post-eof preallocation (since data must be flushed such
that the cache can be invalidated and data can be shifted). That means
said space must also be initialized before being shifted into the
accessible region of the file only to be immediately truncated off as
the last part of the collapse. In other words, the eofblocks trim will
happen anyways, we just run it first to ensure the file remains in a
consistent state throughout the collapse.
Finally, detect and fail explicitly in the event of a delalloc extent
during the extent shift. The implementation does not support delalloc
extents and the caller is expected to prevent this scenario in advance
as is done by collapse.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_bmap_shift_extents() has a variety of conditions and error checks
that make the logic difficult to follow and indent heavy. Refactor the
loop body of this function into a new xfs_bmse_shift_one() helper. This
simplifies the error checks, eliminates index decrement on merge hack by
pushing the index increment down into the helper, and makes the code
more readable by reducing multiple levels of indentation.
This is a code refactor only. The behavior of extent shift and collapse
range is not modified.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The extent shift mechanism in xfs_bmap_shift_extents() is complicated
and handles several different, non-deterministic scenarios. These
include extent shifts, extent merges and potential btree updates in
either of the former scenarios.
Refactor the code to be more linear and readable. The loop logic in
xfs_bmap_shift_extents() and some initial error checking is adjusted
slightly. The associated btree lookup and update/delete operations are
condensed into single blocks of code. This reduces the number of
btree-specific blocks and facilitates the separation of the merge
operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.
This is a code refactor only. The behavior of extent shift and collapse
range is not modified.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The collapse range implementation uses a transaction per extent shift.
The progress of the overall operation is tracked via the current extent
index of the in-core extent list. This is racy because the ilock must be
dropped and reacquired for each transaction according to locking and log
reservation rules. Therefore, writeback to prior regions of the file is
possible and can change the extent count. This changes the extent to
which the current index refers and causes the collapse to fail mid
operation. To avoid this problem, the entire file is currently written
back before the collapse operation starts.
To eliminate the need to flush the entire file, use the file offset
(fsb) to track the progress of the overall extent shift operation rather
than the extent index. Modify xfs_bmap_shift_extents() to
unconditionally convert the start_fsb parameter to an extent index and
return the file offset of the extent where the shift left off, if
further extents exist. The bulk of ths function can remain based on
extent index as ilock is held by the caller. xfs_collapse_file_space()
now uses the fsb output as the starting point for the subsequent shift.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.
The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.
Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.
This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.
cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e
Root-cause-found-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v3.18 RCU changes from Paul E. McKenney:
"
* Update RCU documentation. These were posted to LKML at
https://lkml.org/lkml/2014/8/28/378.
* Miscellaneous fixes. These were posted to LKML at
https://lkml.org/lkml/2014/8/28/386. An additional fix that
eliminates a documented (but now inconvenient) deadlock between
RCU hotplug and expedited grace periods was posted at
https://lkml.org/lkml/2014/8/28/573.
* Changes related to No-CBs CPUs and NO_HZ_FULL. These were posted
to LKML at https://lkml.org/lkml/2014/8/28/412.
* Torture-test updates. These were posted to LKML at
https://lkml.org/lkml/2014/8/28/546 and at
https://lkml.org/lkml/2014/9/11/1114.
* RCU-tasks implementation. These were posted to LKML at
https://lkml.org/lkml/2014/8/28/540.
"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fs-cache fixes from David Howells:
- Put a timeout in releasepage() to deal with a recursive hang between
the memory allocator, writeback, ext4 and fscache under memory
pressure.
- Fix a pair of refcount bugs in the fscache error handling.
- Remove a couple of unused pagevecs.
- The cachefiles requirement that the base directory support rename
should permit rename2 as an alternative - otherwise certain
filesystems cannot now be used as backing stores (such as ext4).
* tag 'fscache-fixes-20140917' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
CacheFiles: Handle rename2
cachefiles: remove two unused pagevecs.
FS-Cache: refcount becomes corrupt under vma pressure.
FS-Cache: Reduce cookie ref count if submit fails.
FS-Cache: Timeout for releasepage()
|