Age | Commit message (Collapse) | Author |
|
If ext4_jbd2_file_inode() in ext4_ordered_write_end() fails for some
reasons, this function returns to caller without unlocking the page.
It leads to the deadlock, and the patch fixes this issue.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Fix a bug introduced by 20b45077. We have to return EINVAL on mount
failure, but doing that too early in the sequence leaves all of the
devices opened exclusively. This also fixes an issue where under some
scenarios only a second mount -o degraded <devices> command would
succeed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Initialize fs_info->bdev_holder a bit earlier to be able to pass a
correct holder id to blkdev_get() when opening seed devices with O_EXCL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
If lookup_extent_backref fails, path->nodes[0] reasonably could be
null along with other callers of btrfs_print_leaf, so ensure we have a
valid extent buffer before dereferencing.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
|
|
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
In btrfs_get_acl(), when the second __btrfs_getxattr() call fails,
acl is not correctly set.
Therefore, a wrong value might return to the caller.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
|
|
Free space items are located in tree of tree roots, not in the extent
tree. It didn't pop up because lookup_free_space_inode() grabs the
inode all the time instead of actually searching the tree.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
|
|
To reproduce the bug:
# mount -o nodatacow /dev/sda7 /mnt/
# dd if=/dev/zero of=/mnt/tmp bs=4K count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
# dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
dd: writing `/mnt/tmp': Input/output error
1+0 records in
0+0 records out
btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
mistakenly takes it as an error.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
|
It's not a big deal if we fail to allocate the array, and instead of
panic we can just give up compressing.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
|
Otherwise we can execced the array bound of path->slots[].
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
|
We should retirn EINVAL if the start is beyond the end of the file
system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
check for it.
Also in the btrfs_trim_fs() it is possible that len+start might overflow
if big values are passed. Fix it by decrementing the len so that start+len
is equal to the file system size in the worst case.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
|
|
We won't defrag an extent, if it's bigger than the threshold we
specified and there's no small extent before it, but actually
the code doesn't work this way.
There are three bugs:
- When should_defrag_range() decides we should keep on defragmenting
an extent, last_len is not incremented. (old bug)
- The length that passes to should_defrag_range() is not the length
we're going to defrag. (new bug)
- We always defrag 256K bytes data, and a big extent can be part of
this range. (new bug)
For a file with 4 extents:
| 4K | 4K | 256K | 256K |
The result of defrag with (the default) 256K extent thresh should be:
| 264K | 256K |
but with those bugs, we'll get:
| 520K |
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
|
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
|
|
It's off-by-one, and thus we may skip the last page while defragmenting.
An example case:
# create /mnt/file with 2 4K file extents
# btrfs fi defrag /mnt/file
# sync
# filefrag /mnt/file
/mnt/file: 2 extents found
So it's not defragmented.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
|
Don't use inode->i_size directly, since we're not holding i_mutex.
This also fixes another bug, that i_size can change after it's checked
against 0 and then (i_size - 1) can be negative.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
|
There's an off-by-one bug:
# create a file with lots of 4K file extents
# btrfs fi defrag /mnt/file
# sync
# filefrag -v /mnt/file
Filesystem type is: 9123683e
File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 3372 64
1 64 3136 3435 1
2 65 3436 3136 64
3 129 3201 3499 1
4 130 3500 3201 64
5 194 3266 3563 1
6 195 3564 3266 64
7 259 3331 3627 1
8 260 3628 3331 40 eof
After this patch:
...
# filefrag -v /mnt/file
Filesystem type is: 9123683e
File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 3372 300 eof
/mnt/file: 1 extent found
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
|
|
kmemleak found this:
unreferenced object 0xffff8801b64af968 (size 512):
comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s)
hex dump (first 32 bytes):
00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff ................
80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff ................
backtrace:
[<ffffffff816875cc>] kmemleak_alloc+0x5c/0xc0
[<ffffffff8114aec3>] kmem_cache_alloc_trace+0x163/0x240
[<ffffffff8127a290>] btrfs_defrag_file+0xf0/0xb20
[<ffffffff8125d9a5>] btrfs_run_defrag_inodes+0x165/0x210
[<ffffffff812479d7>] cleaner_kthread+0x177/0x190
[<ffffffff81075c7d>] kthread+0x8d/0xa0
[<ffffffff816af5f4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
"pages" is not always freed. Fix it removing the unnecesary additional return.
Signed-off-by: Diego Calleja <diegocg@gmail.com>
|
|
Offset field in data extent backref can underflow if clone range ioctl
is used. We can reliably detect the underflow because max file size is
limited to 2^63 and max data extent size is limited by block group size.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
|
|
|
|
Add support to print nostrictsync and noperm mount options in
/proc/mounts for shares mounted with these options.
(cleanup merge conflict in Sachin's original patch)
Suggested-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
sysfs is a core piece of ifrastructure that many people use and
few people have all of the rules in their head on how to use
it correctly. Add warnings for people using tagged directories
improperly to that any misuses can be caught and diagnosed quickly.
A single inexpensive test in sysfs_find_dirent is almost sufficient
to catch all possible misuses. An additional warning is needed
in sysfs_add_dirent so that we actually fail when attempting to
add an untagged dirent in a tagged directory.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now that /sys/class/net/bonding_masters is implemented as a tagged sysfs
file we can remove support for untagged files in tagged directories.
This change removes any ambiguity of what a NULL namespace value
means. A NULL namespace parameter after this patch means
that we are talking about an untagged sysfs dirent.
This makes the sysfs code much less prone to mistakes when during
maintenance.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Looking up files in sysfs is hard to understand and analyize because we
currently allow placing untagged files in tagged directories. In the
implementation of that we have two subtly different meanings of NULL.
NULL meaning there is no tag on a directory entry and NULL meaning
we don't care which namespace the lookup is performed for. This
multiple uses of NULL have resulted in subtle bugs (since fixed)
in the code.
Currently it is only the bonding driver that needs to have an untagged
file in a tagged directory.
To untagle this mess I am adding support for tagged files to sysfs.
Modifying the bonding driver to implement bonding_masters as a tagged
file. Registering bonding_masters once for each network namespace.
Then I am removing support for untagged entries in tagged sysfs
directories.
Resulting in code that is much easier to reason about.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We don't need a mempool in order to guarantee reliable NFS read performance.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Don't rely on the PageError flag to tell us if one of the partial reads of
the page failed. Instead, replace that with a dedicated flag in the
struct nfs_page.
Then clean out redundant uses of the PageError flag: the VM no longer
checks it for reads.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
The generic file read code does that for us anyway.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
It can trivially be replaced with rpc_restart_call_prepare.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
CIFS currently uses wait_event_killable to put tasks to sleep while
they await replies from the server. That function though does not
allow the freezer to run. In many cases, the network interface may
be going down anyway, in which case the reply will never come. The
client then ends up blocking the computer from suspending.
Fix this by adding a new wait_event_freezable variant --
wait_event_freezekillable. The idea is to combine the behavior of
wait_event_killable and wait_event_freezable -- put the task to
sleep and only allow it to be awoken by fatal signals, but also
allow the freezer to do its job.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Tune bdi.ra_pages to be a multiple of the rsize. This prevents the VFS
from asking for pages that require small reads to satisfy.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Currently we cap the rsize at a value that fits in CIFSMaxBufSize. That's
not needed any longer for readpages. Allow the use of larger values for
readpages. cifs_iovec_read and cifs_read however are still limited to the
CIFSMaxBufSize. Make sure they don't exceed that.
The patch also changes the rsize defaults. The default when unix
extensions are enabled is set to 1M for parity with the wsize, and there
is a hard cap of ~16M.
When unix extensions are not enabled, the default is set to 60k. According
to MS-CIFS, Windows servers can only send a max of 60k at a time, so
this is more efficient than requesting a larger size. If the user wishes
however, the max can be extended up to 128k - the length of the READ_RSP
header.
Really old servers however require a special hack to ensure that we don't
request too large a read.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Now that we have code in place to do asynchronous reads, convert
cifs_readpages to use it. The new cifs_readpages walks the page_list
that gets passed in, locks and adds the pages to the pagecache and
sets up cifs_readdata to handle the reads.
The rest is handled by the cifs_async_readv infrastructure.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
...which will allow cifs to do an asynchronous read call to the server.
The caller will allocate and set up cifs_readdata for each READ_AND_X
call that should be issued on the wire. The pages passed in are added
to the pagecache, but not placed on the LRU list yet (as we need the
page->lru to keep the pages on the list in the readdata).
When cifsd identifies the mid, it will see that there is a special
receive handler for the call, and use that to receive the rest of the
frame. cifs_readv_receive will then marshal up a kvec array with
kmapped pages from the pagecache, which eliminates one copy of the
data. Once the data is received, the pages are added to the LRU list,
set uptodate, and unlocked.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
There is no pad, and it simplifies the code to remove the "Data" field.
None of the existing code relies on these fields, or on the READ_RSP
being a particular length.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
In order to handle larger SMBs for readpages and other calls, we want
to be able to read into a preallocated set of buffers. Rather than
changing all of the existing code to preallocate buffers however, we
instead add a receive callback function to the MID.
cifsd will call this function once the mid_q_entry has been identified
in order to receive the rest of the SMB. If the mid can't be identified
or the receive pointer is unset, then the standard 3rd phase receive
function will be called.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Move the entire 3rd phase of the receive codepath into a separate
function in preparation for the addition of a pluggable receive
function.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
In order to receive directly into a preallocated buffer, we need to ID
the mid earlier, before the bulk of the response is read. Call the mid
finding routine as soon as we're able to read the mid.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
We have several functions that need to access these pointers. Currently
that's done with a lot of double pointer passing. Instead, move them
into the TCP_Server_Info and simplify the handling.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Change find_cifs_mid to only return NULL if a mid could not be found.
If we got part of a multi-part T2 response, then coalesce it and still
return the mid. The caller can determine the T2 receive status from
the flags in the mid.
With this change, there is no need to pass a pointer to "length" as
well so just pass by value. If a mid is found, then we can just mark
it as malformed. If one isn't found, then the value of "length" won't
change anyway.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Begin breaking up find_cifs_mid into smaller pieces. The parts that
coalesce T2 responses don't really need to be done under the
GlobalMid_lock anyway. Create a new function that just finds the
mid on the list, and then later takes it off the list if the entire
response has been received.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Have the demultiplex thread receive just enough to get to the MID, and
then find it before receiving the rest. Later, we'll use this to swap
in a preallocated receive buffer for some calls.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Having to continually allocate a new kvec array is expensive. Allocate
one that's big enough, and only reallocate it as needed.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
Eventually we'll want to allow cifsd to read data directly into the
pagecache. In order to do that we'll need a routine that can take a
kvec array and pass that directly to kernel_recvmsg.
Unfortunately though, the kernel's recvmsg routines modify the kvec
array that gets passed in, so we need to use a copy of the kvec array
and refresh that copy on each pass through the loop.
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
|
|
I noticed we had a little bit of latency when writing out the space cache
inodes. It's because we flush it before we write anything in case we have dirty
pages already there. This doesn't matter though since we're just going to
overwrite the space, and there really shouldn't be any dirty pages anyway. This
makes some of my tests run a little bit faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
Mitch kept hitting a panic because he was getting ENOSPC. One of my previous
patches makes it so we are much better at not allocating new metadata chunks.
Unfortunately coupled with the overcommit patch this works us into a bit of a
problem if we are removing a bunch of space and end up chewing up all of our
space with pinned extents. We can allocate chunks fine and overflow is ok, but
the only way to reclaim this space is to commit the transaction. So if we go to
overcommit, first check and see how much pinned space we have. If we have more
than 80% of the free space chewed up with pinned extents, just commit the
transaction, this will free up enough space for our reservation and we won't
have this problem anymore. With this patch Mitch's test doesn't blow up
anymore. Thanks,
Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
Currently btrfs_block_rsv_check does 2 things, it will either refill a block
reserve like in the truncate or refill case, or it will check to see if there is
enough space in the global reserve and possibly refill it. However because of
overcommit we could be well overcommitting ourselves just to try and refill the
global reserve, when really we should just be committing the transaction. So
breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill
will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
it will only tell you if the factor of the total space is still reserved.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
In __unlink_start_trans() if we don't have enough room for a reservation we will
check to see if the unlink will free up space. If it does that's great, but we
will still could add an orphan item, so we need to reserve enough space to add
the orphan item. Do this and migrate the space the global reserve so it all
works out right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
We started setting trans->block_rsv = NULL to allow the delayed refs flushing
stuff to use the right block_rsv and then just made
btrfs_trans_release_metadata() unconditionally use the trans block rsv. The
problem with this is we need to reserve some space in the transaction and then
migrate it to the global block rsv, so we need to be able to free that out
properly. So instead just move btrfs_trans_release_metadata() before the
delayed ref flushing and use trans->block_rsv for the freeing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|
|
Currently we only allow a maximum of 2 megabytes of pages to be flushed at a
time. This was ok before, but now we have overcommit which will screw us in a
heartbeat if we are quickly filling the disk. So instead pick either 2
megabytes or the number of pages we need to reclaim to be safe again, which ever
is larger. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
|