Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
"The largest set of changes here come from Miao Xie. He's cleaning up
and improving read recovery/repair for raid, and has a number of
related fixes.
I've merged another set of fsync fixes from Filipe, and he's also
improved the way we handle metadata write errors to make sure we force
the FS readonly if things go wrong.
Otherwise we have a collection of fixes and cleanups. Dave Sterba
gets a cookie for removing the most lines (thanks Dave)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
btrfs: Fix compile error when CONFIG_SECURITY is not set.
Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
btrfs: Make btrfs handle security mount options internally to avoid losing security label.
Btrfs: send, don't delay dir move if there's a new parent inode
btrfs: add more superblock checks
Btrfs: fix race in WAIT_SYNC ioctl
Btrfs: be aware of btree inode write errors to avoid fs corruption
Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
btrfs: fix shadow warning on cmp
Btrfs: fix compilation errors under DEBUG
Btrfs: fix crash of btrfs_release_extent_buffer_page
Btrfs: add missing end_page_writeback on submit_extent_page failure
btrfs: Fix the wrong condition judgment about subset extent map
Btrfs: fix build_backref_tree issue with multiple shared blocks
Btrfs: cleanup error handling in build_backref_tree
btrfs: move checks for DUMMY_ROOT into a helper
btrfs: new define for the inline extent data start
btrfs: kill extent_buffer_page helper
btrfs: drop constant param from btrfs_release_extent_buffer_page
btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe0c
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
|
|
Fix the following compile error when CONFIG_SECURITY is not set:
error: 'struct security_mnt_opts' has no member named 'num_mnt_opts'
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull "trivial tree" updates from Jiri Kosina:
"Usual pile from trivial tree everyone is so eagerly waiting for"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
Remove MN10300_PROC_MN2WS0038
mei: fix comments
treewide: Fix typos in Kconfig
kprobes: update jprobe_example.c for do_fork() change
Documentation: change "&" to "and" in Documentation/applying-patches.txt
Documentation: remove obsolete pcmcia-cs from Changes
Documentation: update links in Changes
Documentation: Docbook: Fix generated DocBook/kernel-api.xml
score: Remove GENERIC_HAS_IOMAP
gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
tty: doc: Fix grammar in serial/tty
dma-debug: modify check_for_stack output
treewide: fix errors in printk
genirq: fix reference in devm_request_threaded_irq comment
treewide: fix synchronize_rcu() in comments
checkstack.pl: port to AArch64
doc: queue-sysfs: minor fixes
init/do_mounts: better syntax description
MIPS: fix comment spelling
powerpc/simpleboot: fix comment
...
|
|
Commit fccb84c94 moved added some helpers to cleanup our sanity tests,
but it looks like both Dave and I always compile with the tests enabled.
This fixes things to work when they are turned off too.
Signed-off-by: Chris Mason <clm@fb.com>
|
|
security label.
[BUG]
Originally when mount btrfs with "-o subvol=" mount option, btrfs will
lose all security lable.
And if the btrfs fs is mounted somewhere else, due to the lost of
security lable, SELinux will refuse to mount since the same super block
is being mounted using different security lable.
[REPRODUCER]
With SELinux enabled:
#mkfs -t btrfs /dev/sda5
#mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs
#btrfs subvolume create /mnt/btrfs/subvol
#mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5
/mnt/test
kernel message:
SELinux: mount invalid. Same superblock, different security settings
for (dev sda5, type btrfs)
[REASON]
This happens because btrfs will call vfs_kern_mount() and then
mount_subtree() to handle subvolume name lookup.
First mount will cut off all the security lables and when it comes to
the second vfs_kern_mount(), it has no security label now.
[FIX]
This patch will makes btrfs behavior much more like nfs,
which has the type flag FS_BINARY_MOUNTDATA,
making btrfs handles the security label internally.
So security label will be set in the real mount time and won't lose
label when use with "subvol=" mount option.
Reported-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
Signed-off-by: Chris Mason <clm@fb.com>
Conflicts:
fs/btrfs/extent_io.c
|
|
If between two snapshots we rename an existing directory named X to Y and
make it a child (direct or not) of a new inode named X, we were delaying
the move/rename of the former directory unnecessarily, which would result
in attempting to rename the new directory from its orphan name to name X
prematurely.
Minimal reproducer:
$ mkfs.btrfs -f /dev/vdd
$ mount /dev/vdd /mnt
$ mkdir -p /mnt/merlin/RC/OSD/Source
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
$ mkdir /mnt/OSD
$ mv /mnt/merlin/RC/OSD /mnt/OSD/OSD-Plane_788
$ mv /mnt/OSD /mnt/merlin/RC
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
$ btrfs send /mnt/mysnap1 -f /tmp/1.snap
$ btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap
$ mkfs.btrfs -f /dev/vdc
$ mount /dev/vdc /mnt2
$ btrfs receive /mnt2 -f /tmp/1.snap
$ btrfs receive /mnt2 -f /tmp/2.snap
The second receive (from an incremental send) failed with the following
error message: "rename o261-7-0 -> merlin/RC/OSD failed".
This is a regression introduced in the 3.16 kernel.
A test case for xfstests follows.
Reported-by: Marc Merlin <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Populate btrfs_check_super_valid() with checks that try to verify
consistency of superblock by additional conditions that may arise from
corrupted devices or bitflips. Some of tests are only hints and issue
warnings instead of failing the mount, basically when the checks are
derived from the data found in the superblock.
Tested on a broken image provided by Qu.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We check whether transid is already committed via last_trans_committed and
then search through trans_list for pending transactions. If
last_trans_committed is updated by btrfs_commit_transaction after we check
it (there is no locking), we will fail to find the committed transaction
and return EINVAL to the caller. This has been observed occasionally by
ceph-osd (which uses this ioctl heavily).
Fix by rechecking whether the provided transid <= last_trans_committed
after the search fails, and if so return 0.
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS
and keep prototype in qgroup.h only
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
cmp was declared twice in btrfs_compare_trees resulting in a shadow
warning. This patch renames second internal variable.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
bi_sector and bi_size moved to bi_iter since commit 4f024f3797c4
("block: Abstract out bvec iterator")
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
This is actually inspired by Filipe's patch. When write_one_eb() fails on
submit_extent_page(), it'll give up writing this eb and mark it with
EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure,
there are some left pages which remain DIRTY, and if a later COW on this eb
happens, ie. eb is COWed and freed, it'd run into BUG_ON in
btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));
This adds the missing clear_page_dirty_for_io() for the rest pages of eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
is using wrong condition to judgement whether the range is a subset of a
existing extent map.
This may cause bug in btrfs no-holes mode.
This patch will correct the judgment and fix the bug.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Marc Merlin sent me a broken fs image months ago where it would blow up in the
upper->checked BUG_ON() in build_backref_tree. This is because we had a
scenario like this
block a -- level 4 (not shared)
|
block b -- level 3 (reloc block, shared)
|
block c -- level 2 (not shared)
|
block d -- level 1 (shared)
|
block e -- level 0 (shared)
We go to build a backref tree for block e, we notice block d is shared and add
it to the list of blocks to lookup it's backrefs for. Now when we loop around
we will check edges for the block, so we will see we looked up block c last
time. So we lookup block d and then see that the block that points to it is
block c and we can just skip that edge since we've already been up this path.
The problem is because we clear need_check when we see block d (as it is shared)
we never add block b as needing to be checked. And because block c is in our
path already we bail out before we walk up to block b and add it to the backref
check list.
To fix this we need to reset need_check if we trip over a block that doesn't
need to be checked. This will make sure that any subsequent blocks in the path
as we're walking up afterwards are added to the list to be processed. With this
patch I can now mount Marc's fs image and it'll complete the balance without
panicing. Thanks,
Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When balance panics it tends to panic in the
BUG_ON(!upper->checked);
test, because it means it couldn't build the backref tree properly. This is
annoying to users and frankly a recoverable error, nothing in this function is
actually fatal since it is just an in-memory building of the backrefs for a
given bytenr. So go through and change all the BUG_ON()'s to ASSERT()'s, and
fix the BUG_ON(!upper->checked) thing to just return an error.
This patch also fixes the error handling so it tears down the work we've done
properly. This code was horribly broken since we always just panic'ed instead
of actually erroring out, so it needed to be completely re-worked. With this
patch my broken image no longer panics when I mount it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Use a common definition for the inline data start so we don't have to
open-code it and introduce bugs like "Btrfs: fix wrong max inline data
size limit" fixed.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
It used to be more complex but now it's just a simple array access.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
All callers use the same value, simplify the function.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
The structure is frequently reused. Rename it according to the slab
name.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
btrfs_interface_init rarely fails but we could leak the prelim_ref slab.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
The enum exists but is not consistently used.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
The last users are long gone.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
8MiB is way too large and likely set by mistake. This is not
a significant issue as in practice the max amount of data
added to an inline extent is also limited by the page cache
and btree leaf sizes.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free +
_tree_block family. The parameter blocksize was set to the metadata
block size, directly or indirectly.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
We know the tree block size, no need to pass it around.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
It's trivial with a single user. And remove one pointless BUG_ON.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Errors in readahead are not fatal and ignored elsewhere in the code.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
The parent_transid parameter has been unused since its introduction in
ca7a79ad8dbe2466 ("Pass down the expected generation number when reading
tree blocks"). In reada_tree_block, it was even wrongly set to leafsize.
Transid check is done in the proper read and readahead ignores errors.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
There are the branch hints that obviously depend on the data being
processed, the CPU predictor will do better job according to the actual
load. It also does not make sense to use the hints in slow paths that do
a lot of other operations like locking, waiting or IO.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Unlikely is implicit for NULL checks of pointers.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
Signed type mismatches the ioctl structure, all extent calculations are
done on unsigned types.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
|
|
When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff. This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush. But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC. Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation. Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
during log replay. This is because we use fs_info->fs_root as our root for
shrinking and such. Technically we can use whatever root we want, but let's
just not allow async reclaim while we're doing log replay. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space. This happens because all the chunks were allocated for
data since the metadata requirements were so low. But now there's a bunch of
empty data block groups and not enough metadata space to do anything. This
patch solves this problem by automatically deleting empty block groups. If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.
When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it. This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"I've got a revert to fix a regression with btrfs device registration,
and Filipe has part two of his fsync fix from last week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: device_list_add() should not update list when mounted"
Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
|
|
When we do a fast fsync, we start all ordered operations and then while
they're running in parallel we visit the list of modified extent maps
and construct their matching file extent items and write them to the
log btree. After that, in btrfs_sync_log() we wait for all the ordered
operations to finish (via btrfs_wait_logged_extents).
The problem with this is that we were completely ignoring errors that
can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
or -EIO errors for example. When such error happens, it means we have parts
of the on disk extent that weren't written to, and so we end up logging
file extent items that point to these extents that contain garbage/random
data - so after a crash/reboot plus log replay, we get our inode's metadata
pointing to those extents.
This worked in contrast with the full (non-fast) fsync path, where we
start all ordered operations, wait for them to finish and then write
to the log btree. In this path, after each ordered operation completes
we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
-EIO if so (via btrfs_wait_ordered_range).
So if an error happens with any ordered operation, just return a -EIO
error to userspace, so that it knows that not all of its previous writes
were durably persisted and the application can take proper action (like
redo the writes for e.g.) - and definitely not leave any file extent items
in the log refer to non fully written extents.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When the fsync callback (btrfs_sync_file) starts, it first waits for
the writeback of any dirty pages to start and finish without holding
the inode's mutex (to reduce contention). After this it acquires the
inode's mutex and repeats that process via btrfs_wait_ordered_range
only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
is set on the inode).
This is not safe for a non full sync - we need to start and wait for
writeback to finish for any pages that might have been made dirty
before acquiring the inode's mutex and after that first step mentioned
before. Why this is needed is explained by the following comment added
to btrfs_sync_file:
"Right before acquiring the inode's mutex, we might have new
writes dirtying pages, which won't immediately start the
respective ordered operations - that is done through the
fill_delalloc callbacks invoked from the writepage and
writepages address space operations. So make sure we start
all ordered operations before starting to log our inode. Not
doing this means that while logging the inode, writeback
could start and invoke writepage/writepages, which would call
the fill_delalloc callbacks (cow_file_range,
submit_compressed_extents). These callbacks add first an
extent map to the modified list of extents and then create
the respective ordered operation, which means in
tree-log.c:btrfs_log_inode() we might capture all existing
ordered operations (with btrfs_get_logged_extents()) before
the fill_delalloc callback adds its ordered operation, and by
the time we visit the modified list of extent maps (with
btrfs_log_changed_extents()), we see and process the extent
map they created. We then use the extent map to construct a
file extent item for logging without waiting for the
respective ordered operation to finish - this file extent
item points to a disk location that might not have yet been
written to, containing random data - so after a crash a log
replay will make our inode have file extent items that point
to disk locations containing invalid data, as we returned
success to userspace without waiting for the respective
ordered operation to finish, because it wasn't captured by
btrfs_get_logged_extents()."
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|