Age | Commit message (Collapse) | Author |
|
Currently we copy all the file information into the log, inode item, the
refs, xattrs etc. Except most of this doesn't change from fsync to fsync,
just the inode item changes. So set a flag if an xattr changes or a link is
added, and otherwise only log the inode item. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Currently udev does not know about the device being removed from the
file system. This may result in the situation where we're unable to
mount the file system by UUID or by LABEL because the by-uuid and
by-label links may still point to the device which is no longer part of
the btrfs file system and hence does not have any btrfs super block.
It can be easily reproduced by the following:
mkfs.btrfs -L bugfs /dev/loop[0-6]
mount /dev/loop0 /mnt/test
btrfs device delete /dev/loop0 /mnt/test
umount /mnt/test
mount LABEL=bugfs /mnt/test <---- this fails
then see:
ls -l /dev/disk/by-label/bugfs
which will still point to the /dev/loop0
We did not noticed this before because libblkid would send the udev
event for us when it notice that the link does not fit the reality,
however it does not do that anymore and completely relies on udev
information.
Fix this by sending the KOBJ_CHANGE event to the bdev kobject after
successful device removal.
Note that this does not affect device addition, because we will open the
device prior the addition from userspace and udev will notice that and
reread the device afterwards.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
ret variant may be set to 0 if we read page successfully, but it might be
released before we lock it again. On this case, if we fail to allocate a
new page, we will return 0, it is wrong, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Since we can pre-allocate the space past EOF, we should be able to reclaim
that space if we need. This patch implements it by removing the EOF check.
Though the manual of fallocate command says we can use truncate command to
reclaim the pre-allocated space which past EOF, but because truncate command
changes the file size, we must run several commands to reclaim the space if we
don't want to change the file size, so it is not a good choice.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Steps to reproduce:
# mkfs.btrfs <disk>
# mount <disk> <mnt>
# dd if=/dev/zero of=<mnt>/<file> bs=512 seek=5 count=8
# fallocate -p -o 2048 -l 16384 <mnt>/<file>
# dd if=/dev/zero of=<mnt>/<file> bs=4096 seek=3 count=8 conv=notrunc,nocreat
# umount <mnt>
# dmesg
WARNING: at fs/btrfs/inode.c:7140 btrfs_destroy_inode+0x2eb/0x330
The reason is that we inputed a range which is beyond the end of the file. And
because the end of this range was not page-aligned, we had to truncate the last
page in this range, this operation is similar to a buffered file write. In other
words, we reserved enough space and clear the data which was in the hole range
on that page. But when we expanded that test file, write the data into the same
page, we forgot that we have reserved enough space for the buffered write of
that page because in most cases there is no page that is beyond the end of
the file. As a result, we reserved the space twice.
In fact, we needn't truncate the page if it is beyond the end of the file, just
release the allocated space in that range. Fix the above problem by this way.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
(start + len) is the start of the adjacent extent, not the end of the current
extent, so we should not use it to check the hole is on the same page or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We forget to release the reserved space in the error path of delalloc
reservatiom, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If we runt the direct IO, we should not run auto defrag, because it may
introduce buffered IO vs direcIO problem, and make direct IO slow down.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
block_group->sectorsize
We should use ctl->unit for free space calculation instead of block_group->sectorsize
even though for free space use_bitmap or free space cluster we only have sectorsize assigned to ctl->unit currently. Also, we can keep it consisten in code style.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Refactor it by checking whether the inode has been created and needs to be
dropped (drop_inode_on_err) and also if the err variable is set. That way the
variable doesn't need to be set on each and every error handling block.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)
This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
When the flag not supported is specified, it is necessary to return the error
to the caller.
So, we add the validity check of the fiemap's flag.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Passing a null extended attribute value means to remove the attribute,
but we don't have to add a new NULL extended attribute.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If the acl can be exactly represented in the traditional file
mode permission bits, we don't set another acl attribute.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
alloc_end is not the real end of the current extent, it is the start of the
next adjoining extent. So we needn't +1 when calculating the size the space
that is about to be reserved.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The kernel developers have implemented some often-used align macros, we should
use them instead of the complex code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This regression was introduced by the device-replace patches.
Scrub immediately stops checking those disks that have write errors.
This is nothing that happens in the real world, but it is wrong
since scrub is the tool to detect and repair defects. Fix it.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This issue was detected by the "0-DAY kernel build testing".
fs/btrfs/volumes.c: In function 'btrfs_rm_device':
fs/btrfs/volumes.c:1505:1: warning: label 'error_close' defined but not used [-Wunused-label]
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The structure member mirror_num is modified concurrently to the
structure member is_iodone. This doesn't require any locking by
design, unless everything is stored in the same 32 bits of a
bit field. This was the case and xfstest 284 was able to
trigger false warnings from the checker code. This patch
seperates the bits and fixes the race.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If we freeze the fs, the auto defragment should not run. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This patch restructure btrfs_run_defrag_inodes() and make the code of the auto
defragment more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We forget to get the defrag lock when we re-add the defragable inode,
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The auto defrag allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We need get write access for qgroup operations, or we will modify the R/O fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We need get write access for scrub, or we will modify the R/O fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Steps to reproduce:
# mkfs.btrfs -d single -m single <disk0> <disk1>
# mount -o ro <disk0> <mnt0>
# mount -o ro <disk0> <mnt1>
# mount -o remount,rw <mnt0>
# umount <mnt0>
# btrfs device delete <disk1> <mnt1>
We can remove a device from a R/O filesystem. The reason is that we just check
the R/O flag of the super block object. It is not enough, because the kernel
may set the R/O flag only for the mount point. We need invoke
mnt_want_write_file()
to do a full check.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Steps to reproduce:
# mkfs.btrfs <partition>
# mount -o ro <partition> <mnt0>
# mount -o ro <partition> <mnt1>
# mount -o remount,rw <mnt0>
# umount <mnt0>
# btrfs fi resize 10g <mnt1>
We re-sized a R/O filesystem. The reason is that we just check the R/O flag
of the super block object. It is not enough, because the kernel may set the
R/O flag only for the mount point. We need invoke mnt_want_write_file() to
do a full check.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
When wen want to set the default subvolume, we must get write access, or
we will change the R/O file system.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If the id of the existed transaction is more than the one we specified, it
means the specified transaction was commited, so we should return 0, not
EINVAL.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If there is no running transaction in the fs, we needn't start a new one when
we want to start sync.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Since we have gotten the root in the caller, just pass it into
btrfs_ioctl_{start, wait}_sync() directly.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If we found an invalid xattr dir item, we'd better try the next one instead.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
io_ctl_map_page is called by many functions in free-space-cache.
In most scenarios, the ->cur is not null, e.g. io_ctl_add_entry.
I think we'd better remove the warn_on here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This is the commit that allows to start the device replace
procedure.
An ioctl() interface is added that supports starting and
canceling the device replace procedure, and to retrieve
the status and progress.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 update from Ted Ts'o:
"There are two major features for this merge window. The first is
inline data, which allows small files or directories to be stored in
the in-inode extended attribute area. (This requires that the file
system use inodes which are at least 256 bytes or larger; 128 byte
inodes do not have any room for in-inode xattrs.)
The second new feature is SEEK_HOLE/SEEK_DATA support. This is
enabled by the extent status tree patches, and this infrastructure
will be used to further optimize ext4 in the future.
Beyond that, we have the usual collection of code cleanups and bug
fixes."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (63 commits)
ext4: zero out inline data using memset() instead of empty_zero_page
ext4: ensure Inode flags consistency are checked at build time
ext4: Remove CONFIG_EXT4_FS_XATTR
ext4: remove unused variable from ext4_ext_in_cache()
ext4: remove redundant initialization in ext4_fill_super()
ext4: remove redundant code in ext4_alloc_inode()
ext4: use sync_inode_metadata() when syncing inode metadata
ext4: enable ext4 inline support
ext4: let fallocate handle inline data correctly
ext4: let ext4_truncate handle inline data correctly
ext4: evict inline data out if we need to strore xattr in inode
ext4: let fiemap work with inline data
ext4: let ext4_rename handle inline dir
ext4: let empty_dir handle inline dir
ext4: let ext4_delete_entry() handle inline data
ext4: make ext4_delete_entry generic
ext4: let ext4_find_entry handle inline data
ext4: create a new function search_dir
ext4: let ext4_readdir handle inline data
ext4: let add_dir_entry handle inline data properly
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
"A quiet cycle for the security subsystem with just a few maintenance
updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
Smack: create a sysfs mount point for smackfs
Smack: use select not depends in Kconfig
Yama: remove locking from delete path
Yama: add RCU to drop read locking
drivers/char/tpm: remove tasklet and cleanup
KEYS: Use keyring_alloc() to create special keyrings
KEYS: Reduce initial permissions on keys
KEYS: Make the session and process keyrings per-thread
seccomp: Make syscall skipping and nr changes more consistent
key: Fix resource leak
keys: Fix unreachable code
KEYS: Add payload preparsing opportunity prior to key instantiate or update
|
|
The writeback code is already capable of passing errors back to user space
by means of the open_context->error. In the case of ENOSPC, Neil Brown
is reporting seeing 2 errors being returned.
Neil writes:
"e.g. if /mnt2/ if an nfs mounted filesystem that has no space then
strace dd if=/dev/zero conv=fsync >> /mnt2/afile count=1
reported Input/output error and the relevant parts of the strace output are:
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
fsync(1) = -1 EIO (Input/output error)
close(1) = -1 ENOSPC (No space left on device)"
Neil then shows that the duplication of error messages appears to be due to
the use of the PageError() mechanism, which causes filemap_fdatawait_range
to return the extra EIO. The regression was introduced by
commit 7b281ee026552f10862b617a2a51acf49c829554 (NFS: fsync() must exit
with an error if page writeback failed).
Fix this by removing the call to SetPageError(), and just relying on
open_context->error reporting the ENOSPC back to fsync().
Reported-by: Neil Brown <neilb@suse.de>
Tested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [3.6+]
|
|
Pull pstore update from Anton Vorontsov:
"Here are just a few fixups for the pstore subsystem, nothing special
this time"
* tag 'for-v3.8' of git://git.infradead.org/users/cbou/linux-pstore:
pstore/ftrace: Adjust for ftrace_ops->func prototype change
pstore/ram: Fix bounds checks for mem_size, record_size, console_size and ftrace_size
pstore/ram: Fix undefined usage of rounddown_pow_of_two(0)
pstore/ram: Fixup section annotations
|
|
If an RPC call is interrupted, assume that the server hasn't processed
the RPC call so that the next time we use the slot, we know that if we
get a NFS4ERR_SEQ_MISORDERED or NFS4ERR_SEQ_FALSE_RETRY, we just have
to bump the sequence number.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Pull GFS2 updates from Steven Whitehouse:
"The main feature this time is the new Orlov allocator and the patches
leading up to it which allow us to allocate new inodes from their own
allocation context, rather than borrowing that of their parent
directory. It is this change which then allows us to choose a
different location for subdirectories when required. This works
exactly as per the ext3 implementation from the users point of view.
In addition to that, we've got a speed up in gfs2_rbm_from_block()
from Bob Peterson, three locking related improvements from Dave
Teigland plus a selection of smaller bug fixes and clean ups."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Set gl_object during inode create
GFS2: add error check while allocating new inodes
GFS2: don't reference inode's glock during block allocation trace
GFS2: remove redundant lvb pointer
GFS2: only use lvb on glocks that need it
GFS2: skip dlm_unlock calls in unmount
GFS2: Fix one RG corner case
GFS2: Eliminate redundant buffer_head manipulation in gfs2_unlink_inode
GFS2: Use dirty_inode in gfs2_dir_add
GFS2: Fix truncation of journaled data files
GFS2: Add Orlov allocator
GFS2: Use proper allocation context for new inodes
GFS2: Add test for resource group congestion status
GFS2: Rename glops go_xmote_th to go_sync
GFS2: Speed up gfs2_rbm_from_block
GFS2: Review bug traps in glops.c
|
|
Shave a few bytes off the slot table size by moving the RPC timestamp
into the sequence results.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
If the server returns NFS4ERR_SEQ_MISORDERED, it could be a sign
that the slot was retired at some point. Retry the attempt after
reinitialising the slot sequence number to 1.
Also add a handler for NFS4ERR_SEQ_FALSE_RETRY. Just bump the slot
sequence number and retry...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.
However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.
Which made the following nasty sequence possible.
pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}
Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
If the inode has no links, then we should force a new lookup.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
It is almost always wrong for NFS to call drop_nlink() after removing a
file. What we really want is to mark the inode's attributes for
revalidation, and we want to ensure that the VFS drops it if we're
reasonably sure that this is the final unlink().
Do the former using the usual cache validity flags, and the latter
by testing if inode->i_nlink == 1, and clearing it in that case.
This also fixes the following warning reported by Neil Brown and
Jeff Layton (among others).
[634155.004438] WARNING:
at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.5.0/lin [634155.004442]
Hardware name: Latitude E6510 [634155.004577] crc_itu_t crc32c_intel
snd_hwdep snd_pcm snd_timer snd soundcor [634155.004609] Pid: 13402, comm:
bash Tainted: G W 3.5.0-36-desktop # [634155.004611] Call Trace:
[634155.004630] [<ffffffff8100444a>] dump_trace+0xaa/0x2b0
[634155.004641] [<ffffffff815a23dc>] dump_stack+0x69/0x6f
[634155.004653] [<ffffffff81041a0b>] warn_slowpath_common+0x7b/0xc0
[634155.004662] [<ffffffff811832e4>] drop_nlink+0x34/0x40
[634155.004687] [<ffffffffa05bb6c3>] nfs_dentry_iput+0x33/0x70 [nfs]
[634155.004714] [<ffffffff8118049e>] dput+0x12e/0x230
[634155.004726] [<ffffffff8116b230>] __fput+0x170/0x230
[634155.004735] [<ffffffff81167c0f>] filp_close+0x5f/0x90
[634155.004743] [<ffffffff81167cd7>] sys_close+0x97/0x100
[634155.004754] [<ffffffff815c3b39>] system_call_fastpath+0x16/0x1b
[634155.004767] [<00007f2a73a0d110>] 0x7f2a73a0d10f
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [3.3+]
|
|
There is no need to cache stale inodes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Same bug as fixed by Idan for write_exec was in read_exec.
Fix the io_state leak and pages state on read error.
Also while at it:
The if (!pcol->read_4_write) at the error path is redundant
because all goto err; are after the if (pcol->read_4_write)
bale out.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|
|
Merge emailed autofs cleanup/fix patches from Ian Kent
* autofs:
autofs4 - use simple_empty() for empty directory check
autofs4 - dont clear DCACHE_NEED_AUTOMOUNT on rootless mount
|
|
For direct (and offset) mounts, if an automounted mount is manually
umounted the trigger mount dentry can appear non-empty causing it to
not trigger mounts. This can also happen if there is a file handle
leak in a user space automounting application.
This happens because, when a ioctl control file handle is opened
on the mount, a cursor dentry is created which causes list_empty()
to see the dentry as non-empty. Since there is a case where listing
the directory of these dentrys is needed, the use of dcache_dir_*()
functions for .open() and .release() is needed.
Consequently simple_empty() must be used instead of list_empty()
when checking for an empty directory.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|