summaryrefslogtreecommitdiffstats
path: root/fs/xfs
AgeCommit message (Collapse)Author
2014-06-10fs,userns: Change inode_capable to capable_wrt_inode_uidgidAndy Lutomirski
The kernel has no concept of capabilities with respect to inodes; inodes exist independently of namespaces. For example, inode_capable(inode, CAP_LINUX_IMMUTABLE) would be nonsense. This patch changes inode_capable to check for uid and gid mappings and renames it to capable_wrt_inode_uidgid, which should make it more obvious what it does. Fixes CVE-2014-4014. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Chinner <david@fromorbit.com> Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-15xfs: list_lru_init returns a negative errorDave Chinner
And we don't invert it properly when initialising the dquot lru list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: negate xfs_icsb_init_counters error value Dave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: negate mount workqueue init error valueDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: fix wrong err sign on xfs_set_acl()Dave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: fix wrong errno from xfs_initxattrsDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: correct error sign on COLLAPSE_RANGE errorsDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: xfs_commit_metadata returns wrong errnoDave Chinner
Invert it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: fix incorrect error sign in xfs_file_aio_readDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15xfs: xfs_dir_fsync() returns positive errnoDave Chinner
And it should be negative. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-08Merge tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs fixes from Dave Chinner: "The main fix is adding support for default ACLs on O_TMPFILE opened inodes to bring XFS into line with other filesystems. Metadata CRCs are now also considered well enough tested to be fully supported, so we're removing the shouty warnings issued at mount time for filesystems with that format. And there's transaction block reservation overrun fix. Summary: - fix a remote attribute size calculation bug that leads to a transaction overrun - add default ACLs to O_TMPFILE files - Remove the EXPERIMENTAL tag from filesystems with metadata CRC support" * tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfs: xfs: remote attribute overwrite causes transaction overrun xfs: initialize default acls for ->tmpfile() xfs: fully support v5 format filesystems
2014-05-06xfs: remote attribute overwrite causes transaction overrunDave Chinner
Commit e461fcb ("xfs: remote attribute lookups require the value length") passes the remote attribute length in the xfs_da_args structure on lookup so that CRC calculations and validity checking can be performed correctly by related code. This, unfortunately has the side effect of changing the args->valuelen parameter in cases where it shouldn't. That is, when we replace a remote attribute, the incoming replacement stores the value and length in args->value and args->valuelen, but then the lookup which finds the existing remote attribute overwrites args->valuelen with the length of the remote attribute being replaced. Hence when we go to create the new attribute, we create it of the size of the existing remote attribute, not the size it is supposed to be. When the new attribute is much smaller than the old attribute, this results in a transaction overrun and an ASSERT() failure on a debug kernel: XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 331 Fix this by keeping the remote attribute value length separate to the attribute value length in the xfs_da_args structure. The enables us to pass the length of the remote attribute to be removed without overwriting the new attribute's length. Also, ensure that when we save remote block contexts for a later rename we zero the original state variables so that we don't confuse the state of the attribute to be removes with the state of the new attribute that we just added. [Spotted by Brain Foster.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-06xfs: initialize default acls for ->tmpfile()Brian Foster
The current tmpfile handler does not initialize default ACLs. Doing so within xfs_vn_tmpfile() makes it roughly equivalent to xfs_vn_mknod(), which is already used as a common create handler. xfs_vn_mknod() does not currently have a mechanism to determine whether to link the file into the namespace. Therefore, further abstract xfs_vn_mknod() into a new xfs_generic_create() handler with a tmpfile parameter. This new handler calls xfs_create_tmpfile() and d_tmpfile() on the dentry when called via ->tmpfile(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05xfs: fully support v5 format filesystemsDave Chinner
We have had this code in the kernel for over a year now and have shaken all the known issues out of the code over the past few releases. It's now time to remove the experimental warnings during mount and fully support the new filesystem format in production systems. Remove the experimental warning, and add a version number to the initial "mounting filesystem" message to tell use what type of filesystem is being mounted. Also, remove the temporary inode cluster size output at mount time now we know that this code works fine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-20Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "These are regression and bug fixes for ext4. We had a number of new features in ext4 during this merge window (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so there were many more regression and bug fixes this time around. It didn't help that xfstests hadn't been fully updated to fully stress test COLLAPSE_RANGE until after -rc1" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits) ext4: disable COLLAPSE_RANGE for bigalloc ext4: fix COLLAPSE_RANGE failure with 1KB block size ext4: use EINVAL if not a regular file in ext4_collapse_range() ext4: enforce we are operating on a regular file in ext4_zero_range() ext4: fix extent merging in ext4_ext_shift_path_extents() ext4: discard preallocations after removing space ext4: no need to truncate pagecache twice in collapse range ext4: fix removing status extents in ext4_collapse_range() ext4: use filemap_write_and_wait_range() correctly in collapse range ext4: use truncate_pagecache() in collapse range ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled ext4: always check ext4_ext_find_extent result ext4: fix error handling in ext4_ext_shift_extents ext4: silence sparse check warning for function ext4_trim_extent ext4: COLLAPSE_RANGE only works on extent-based files ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches ext4: use i_size_read in ext4_unaligned_aio() fs: disallow all fallocate operation on active swapfile fs: move falloc collapse range check into the filesystem methods ...
2014-04-17xfs: fix tmpfile/selinux deadlock and initialize securityBrian Foster
xfstests generic/004 reproduces an ilock deadlock using the tmpfile interface when selinux is enabled. This occurs because xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The latter eventually calls into xfs_xattr_get() which attempts to get the lock again. E.g.: xfs_io D ffffffff81c134c0 4096 3561 3560 0x00000080 ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8 00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540 ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480 Call Trace: [<ffffffff8177f969>] schedule+0x29/0x70 [<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120 [<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs] [<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff810ed179>] ? down_read_nested+0x89/0xa0 [<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs] [<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs] [<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs] [<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs] [<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs] [<ffffffff8124842f>] generic_getxattr+0x4f/0x70 [<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650 [<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20 [<ffffffff813351bb>] security_d_instantiate+0x1b/0x30 [<ffffffff81237db0>] d_instantiate+0x50/0x70 [<ffffffff81237e85>] d_tmpfile+0xb5/0xc0 [<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs] [<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs] [<ffffffff81230388>] path_openat+0x228/0x6a0 [<ffffffff810230f9>] ? sched_clock+0x9/0x10 [<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40 [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0 [<ffffffff8123101a>] do_filp_open+0x3a/0x90 [<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40 [<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0 [<ffffffff8121e3ce>] do_sys_open+0x12e/0x210 [<ffffffff8121e4ce>] SyS_open+0x1e/0x20 [<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b xfs_vn_tmpfile() also fails to initialize security on the newly created inode. Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction has been committed and the inode unlocked. Also, initialize security on the inode based on the parent directory provided via the tmpfile call. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: fix buffer use after free on IO errorEric Sandeen
When testing exhaustion of dm snapshots, the following appeared with CONFIG_DEBUG_OBJECTS_FREE enabled: ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs] indicating that we'd freed a buffer which still had a pending reference, down this path: [ 190.867975] [<ffffffff8133e6fb>] debug_check_no_obj_freed+0x22b/0x270 [ 190.880820] [<ffffffff811da1d0>] kmem_cache_free+0xd0/0x370 [ 190.892615] [<ffffffffa02c5924>] xfs_buf_free+0xe4/0x210 [xfs] [ 190.905629] [<ffffffffa02c6167>] xfs_buf_rele+0xe7/0x270 [xfs] [ 190.911770] [<ffffffffa034c826>] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs] At issue is the fact that if IO fails in xfs_buf_iorequest, we'll queue completion unconditionally, and then call xfs_buf_rele; but if IO failed, there are no IOs remaining, and xfs_buf_rele will free the bp while work is still queued. Fix this by not scheduling completion if the buffer has an error on it; run it immediately. The rest is only comment changes. Thanks to dchinner for spotting the root cause. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: wrong error sign conversion during failed DIO writesDave Chinner
We negate the error value being returned from a generic function incorrectly. The code path that it is running in returned negative errors, so there is no need to negate it to get the correct error signs here. This was uncovered by generic/019. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: unmount does not wait for shutdown during unmountDave Chinner
And interesting situation can occur if a log IO error occurs during the unmount of a filesystem. The cases reported have the same signature - the update of the superblock counters fails due to a log write IO error: XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1 XFS (dm-16): Log I/O Error Detected. Shutting down filesystem XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount. XFS (dm-16): xfs_log_force: error 5 returned. XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s) It can be seen that the last line of output contains a corrupt device name - this is because the log and xfs_mount structures have already been freed by the time this message is printed. A kernel oops closely follows. The issue is that the shutdown is occurring in a separate IO completion thread to the unmount. Once the shutdown processing has started and all the iclogs are marked with XLOG_STATE_IOERROR, the log shutdown code wakes anyone waiting on a log force so they can process the shutdown error. This wakes up the unmount code that is doing a synchronous transaction to update the superblock counters. The unmount path now sees all the iclogs are marked with XLOG_STATE_IOERROR and so never waits on them again, knowing that if it does, there will not be a wakeup trigger for it and we will hang the unmount if we do. Hence the unmount runs through all the remaining code and frees all the filesystem structures while the xlog_iodone() is still processing the shutdown. When the log shutdown processing completes, xfs_do_force_shutdown() emits the "Please umount the filesystem and rectify the problem(s)" message, and xlog_iodone() then aborts all the objects attached to the iclog. An iclog that has already been freed.... The real issue here is that there is no serialisation point between the log IO and the unmount. We have serialisations points for log writes, log forces, reservations, etc, but we don't actually have any code that wakes for log IO to fully complete. We do that for all other types of object, so why not iclogbufs? Well, it turns out that we can easily do this. We've got xfs_buf handles, and that's what everyone else uses for IO serialisation. i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only release the lock in xlog_iodone() when we are finished with the buffer. That way before we tear down the iclog, we can lock and unlock the buffer to ensure IO completion has finished completely before we tear it down. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Bob Mastors <bob.mastors@solidfire.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: collapse range is delalloc challengedDave Chinner
FSX has been detecting data corruption after to collapse range calls. The key observation is that the offset of the last extent in the file was not being shifted, and hence when the file size was adjusted it was truncating away data because the extents handled been correctly shifted. Tracing indicated that before the collapse, the extent list looked like: .... ino 0x5788 state idx 6 offset 26 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 39 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0 and after the shift of 2 blocks: ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0 Note that the last extent did not change offset. After the changing of the file size: ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0 ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0 ino 0x5788 state idx 8 offset 86 block 195964 count 30 flag 0 You can see that the last extent had it's length truncated, indicating that we've lost data. The reason for this is that the xfs_bmap_shift_extents() loop uses XFS_IFORK_NEXTENTS() to determine how many extents are in the inode. This, unfortunately, doesn't take into account delayed allocation extents - it's a count of physically allocated extents - and hence when the file being collapsed has a delalloc extent like this one does prior to the range being collapsed: .... ino 0x5788 state idx 4 offset 11 block 4503599627239429 count 1 flag 0 .... it gets the count wrong and terminates the shift loop early. Fix it by using the in-memory extent array size that includes delayed allocation extents to determine the number of extents on the inode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17xfs: don't map ranges that span EOF for direct IODave Chinner
Al Viro tracked down the problem that has caused generic/263 to fail on XFS since the test was introduced. If is caused by xfs_get_blocks() mapping a single extent that spans EOF without marking it as buffer-new() so that the direct IO code does not zero the tail of the block at the new EOF. This is a long standing bug that has been around for many, many years. Because xfs_get_blocks() starts the map before EOF, it can't set buffer_new(), because that causes he direct IO code to also zero unaligned sectors at the head of the IO. This would overwrite valid data with zeros, and hence we cannot validly return a single extent that spans EOF to direct IO. Fix this by detecting a mapping that spans EOF and truncate it down to EOF. This results in the the direct IO code doing the right thing for unaligned data blocks before EOF, and then returning to get another mapping for the region beyond EOF which XFS treats correctly by setting buffer_new() on it. This makes direct Io behave correctly w.r.t. tail block zeroing beyond EOF, and fsx is happy about that. Again, thanks to Al Viro for finding what I couldn't. [ dchinner: Fix for __divdi3 build error: Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14xfs: zeroing space needs to punch delalloc blocksDave Chinner
When we are zeroing space andit is covered by a delalloc range, we need to punch the delalloc range out before we truncate the page cache. Failing to do so leaves and inconsistency between the page cache and the extent tree, which we later trip over when doing direct IO over the same range. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14xfs: xfs_vm_write_end truncates too much on failureDave Chinner
Similar to the write_begin problem, xfs-vm_write_end will truncate back to the old EOF, potentially removing page cache from over the top of delalloc blocks with valid data in them. Fix this by truncating back to just the start of the failed write. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14xfs: write failure beyond EOF truncates too much dataDave Chinner
If we fail a write beyond EOF and have to handle it in xfs_vm_write_begin(), we truncate the inode back to the current inode size. This doesn't take into account the fact that we may have already made successful writes to the same page (in the case of block size < page size) and hence we can truncate the page cache away from blocks with valid data in them. If these blocks are delayed allocation blocks, we now have a mismatch between the page cache and the extent tree, and this will trigger - at minimum - a delayed block count mismatch assert when the inode is evicted from the cache. We can also trip over it when block mapping for direct IO - this is the most common symptom seen from fsx and fsstress when run from xfstests. Fix it by only truncating away the exact range we are updating state for in this write_begin call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14xfs: kill buffers over failed write ranges properlyDave Chinner
When a write fails, if we don't clear the delalloc flags from the buffers over the failed range, they can persist beyond EOF and cause problems. writeback will see the pages in the page cache, see they are dirty and continually retry the write, assuming that the page beyond EOF is just racing with a truncate. The page will eventually be released due to some other operation (e.g. direct IO), and it will not pass through invalidation because it is dirty. Hence it will be released with buffer_delay set on it, and trigger warnings in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct because invalidation failed and we didn't write the corect amount. This causes failures on block size < page size filesystems in fsx and fsstress workloads run by xfstests. Fix it by completely trashing any state on the buffer that could be used to imply that it contains valid data when the delalloc range over the buffer is punched out during the failed write handling. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-12fs: move falloc collapse range check into the filesystem methodsLukas Czerner
Currently in do_fallocate in collapse range case we're checking whether offset + len is not bigger than i_size. However there is nothing which would prevent i_size from changing so the check is pointless. It should be done in the file system itself and the file system needs to make sure that i_size is not going to change. The i_size check for the other fallocate modes are also done in the filesystems. As it is now we can easily crash the kernel by having two processes doing truncate and fallocate collapse range at the same time. This can be reproduced on ext4 and it is theoretically possible on xfs even though I was not able to trigger it with this simple test. This commit removes the check from do_fallocate and adds it to the file system. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-04-07mm: implement ->map_pages for page cacheKirill A. Shutemov
filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04Merge tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs update from Dave Chinner: "There are a couple of new fallocate features in this request - it was decided that it was easiest to push them through the XFS tree using topic branches and have the ext4 support be based on those branches. Hence you may see some overlap with the ext4 tree merge depending on how they including those topic branches into their tree. Other than that, there is O_TMPFILE support, some cleanups and bug fixes. The main changes in the XFS tree for 3.15-rc1 are: - O_TMPFILE support - allowing AIO+DIO writes beyond EOF - FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS implementation - FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS implementation - IO verifier cleanup and rework - stack usage reduction changes - vm_map_ram NOIO context fixes to remove lockdep warings - various bug fixes and cleanups" * tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits) xfs: fix directory hash ordering bug xfs: extra semi-colon breaks a condition xfs: Add support for FALLOC_FL_ZERO_RANGE fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate xfs: inode log reservations are still too small xfs: xfs_check_page_type buffer checks need help xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation xfs: use NOIO contexts for vm_map_ram xfs: don't leak EFSBADCRC to userspace xfs: fix directory inode iolock lockdep false positive xfs: allocate xfs_da_args to reduce stack footprint xfs: always do log forces via the workqueue xfs: modify verifiers to differentiate CRC from other errors xfs: print useful caller information in xfs_error_report xfs: add xfs_verifier_error() xfs: add helper for updating checksums on xfs_bufs xfs: add helper for verifying checksums on xfs_bufs xfs: Use defines for CRC offsets in all cases xfs: skip pointless CRC updates after verifier failures xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate ...
2014-04-04Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits) ext4: fix premature freeing of partial clusters split across leaf blocks ext4: remove unneeded test of ret variable ext4: fix comment typo ext4: make ext4_block_zero_page_range static ext4: atomically set inode->i_flags in ext4_set_inode_flags() ext4: optimize Hurd tests when reading/writing inodes ext4: kill i_version support for Hurd-castrated file systems ext4: each filesystem creates and uses its own mb_cache fs/mbcache.c: doucple the locking of local from global data fs/mbcache.c: change block and index hash chain to hlist_bl_node ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate ext4: refactor ext4_fallocate code ext4: Update inode i_size after the preallocation ext4: fix partial cluster handling for bigalloc file systems ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents ext4: only call sync_filesystm() when remounting read-only fs: push sync_filesystem() down to the file system's remount_fs() jbd2: improve error messages for inconsistent journal heads jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() jbd2: minimize region locked by j_list_lock in journal_get_create_access() ...
2014-04-03mm + fs: store shadow entries in page cacheJohannes Weiner
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04Merge branch 'xfs-bug-fixes-for-3.15-3' into for-nextDave Chinner
2014-04-04xfs: fix directory hash ordering bugMark Tinguely
Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced in 3.10 incorrectly converted the btree hash index array pointer in xfs_da3_fixhashpath(). It resulted in the the current hash always being compared against the first entry in the btree rather than the current block index into the btree block's hash entry array. As a result, it was comparing the wrong hashes, and so could misorder the entries in the btree. For most cases, this doesn't cause any problems as it requires hash collisions to expose the ordering problem. However, when there are hash collisions within a directory there is a very good probability that the entries will be ordered incorrectly and that actually matters when duplicate hashes are placed into or removed from the btree block hash entry array. This bug results in an on-disk directory corruption and that results in directory verifier functions throwing corruption warnings into the logs. While no data or directory entries are lost, access to them may be compromised, and attempts to remove entries from a directory that has suffered from this corruption may result in a filesystem shutdown. xfs_repair will fix the directory hash ordering without data loss occuring. [dchinner: wrote useful a commit message] cc: <stable@vger.kernel.org> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-04xfs: extra semi-colon breaks a conditionDan Carpenter
There were some extra semi-colons here which mean that we return true unintentionally. Fixes: a49935f200e2 ('xfs: xfs_check_page_type buffer checks need help') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-01xfs_file_buffered_aio_write(): switch to generic_perform_write()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01generic_file_direct_write(): get rid of ppos argumentAl Viro
always equal to &iocb->ki_pos. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01kill the 5th argument of generic_file_buffered_write()Al Viro
same story - it's &iocb->ki_pos in all cases Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01new helper: readlink_copy()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-13fs: push sync_filesystem() down to the file system's remount_fs()Theodore Ts'o
Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org
2014-03-13Merge branch 'xfs-O_TMPFILE-support' into for-nextDave Chinner
Conflicts: fs/xfs/xfs_trans_resv.c - fix for XFS_INODE_CLUSTER_SIZE macro removal
2014-03-13Merge branch 'xfs-bug-fixes-for-3.15-2' into for-nextDave Chinner
2014-03-13Merge branch 'xfs-verifier-cleanup' into for-nextDave Chinner
2014-03-13Merge branch 'xfs-stack-fixes' into for-nextDave Chinner
2014-03-13Merge branch 'xfs-collapse-range' into for-nextDave Chinner
2014-03-13xfs: Add support for FALLOC_FL_ZERO_RANGELukas Czerner
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same functionality as xfs ioctl XFS_IOC_ZERO_RANGE. We can also preallocate blocks past EOF in the same was as with fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain the same even if we preallocate blocks past EOF. It uses the same code to zero range as it is used by the XFS_IOC_ZERO_RANGE ioctl. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: inode log reservations are still too smallDave Chinner
Back in commit 23956703 ("xfs: inode log reservations are too small"), the reservation size was increased to take into account the difference in size between the in-memory BMBT block headers and the on-disk BMDR headers. This solved a transaction overrun when logging the inode size. Recently, however, we've seen a number of these same overruns on kernels with the above fix in it. All of them have been by 4 bytes, so we must still not be accounting for something correctly. Through inspection it turns out the above commit didn't take into account everything it should have. That is, it only accounts for a single log op_hdr structure, when it can actually require up to four op_hdrs - one for each region (log iovec) that is formatted. These regions are the inode log format header, the inode core, and the two forks that can be held in the literal area of the inode. This means we are not accounting for 36 bytes of log space that the transaction can use, and hence when we get inodes in certain formats with particular fragmentation patterns we can overrun the transaction. Fix this by adding the correct accounting for log op_headers in the transaction. Tested-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: xfs_check_page_type buffer checks need helpDave Chinner
xfs_aops_discard_page() was introduced in the following commit: xfs: truncate delalloc extents when IO fails in writeback ... to clean up left over delalloc ranges after I/O failure in ->writepage(). generic/224 tests for this scenario and occasionally reproduces panics on sub-4k blocksize filesystems. The cause of this is failure to clean up the delalloc range on a page where the first buffer does not match one of the expected states of xfs_check_page_type(). If a buffer is not unwritten, delayed or dirty&mapped, xfs_check_page_type() stops and immediately returns 0. The stress test of generic/224 creates a scenario where the first several buffers of a page with delayed buffers are mapped & uptodate and some subsequent buffer is delayed. If the ->writepage() happens to fail for this page, xfs_aops_discard_page() incorrectly skips the entire page. This then causes later failures either when direct IO maps the range and finds the stale delayed buffer, or we evict the inode and find that the inode still has a delayed block reservation accounted to it. We can easily fix this xfs_aops_discard_page() failure by making xfs_check_page_type() check all buffers, but this breaks xfs_convert_page() more than it is already broken. Indeed, xfs_convert_page() wants xfs_check_page_type() to tell it if the first buffers on the pages are of a type that can be aggregated into the contiguous IO that is already being built. xfs_convert_page() should not be writing random buffers out of a page, but the current behaviour will cause it to do so if there are buffers that don't match the current specification on the page. Hence for xfs_convert_page() we need to: a) return "not ok" if the first buffer on the page does not match the specification provided to we don't write anything; and b) abort it's buffer-add-to-io loop the moment we come across a buffer that does not match the specification. Hence we need to fix both xfs_check_page_type() and xfs_convert_page() to work correctly with pages that have mixed buffer types, whilst allowing xfs_aops_discard_page() to scan all buffers on the page for a type match. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: avoid AGI/AGF deadlock scenario for inode chunk allocationBrian Foster
The inode chunk allocation path can lead to deadlock conditions if a transaction is dirtied with an AGF (to fix up the freelist) for an AG that cannot satisfy the actual allocation request. This code path is written to try and avoid this scenario, but it can be reproduced by running xfstests generic/270 in a loop on a 512b fs. An example situation is: - process A attempts an inode allocation on AG 3, modifies the freelist, fails the allocation and ultimately moves on to AG 0 with the AG 3 AGF held - process B is doing a free space operation (i.e., truncate) and acquires the AG 0 AGF, waits on the AG 3 AGF - process A acquires the AG 0 AGI, waits on the AG 0 AGF (deadlock) The problem here is that process A acquired the AG 3 AGF while moving on to AG 0 (and releasing the AG 3 AGI with the AG 3 AGF held). xfs_dialloc() makes one pass through each of the AGs when attempting to allocate an inode chunk. The expectation is a clean transaction if a particular AG cannot satisfy the allocation request. xfs_ialloc_ag_alloc() is written to support this through use of the minalignslop allocation args field. When using the agi->agi_newino optimization, we attempt an exact bno allocation request based on the location of the previously allocated chunk. minalignslop is set to inform the allocator that we will require alignment on this chunk, and thus to not allow the request for this AG if the extra space is not available. Suppose that the AG in question has just enough space for this request, but not at the requested bno. xfs_alloc_fix_freelist() will proceed as normal as it determines the request should succeed, and thus it is allowed to modify the agf. xfs_alloc_ag_vextent() ultimately fails because the requested bno is not available. In response, the caller moves on to a NEAR_BNO allocation request for the same AG. The alignment is set, but the minalignslop field is never reset. This increases the overall requirement of the request from the first attempt. If this delta is the difference between allocation success and failure for the AG, xfs_alloc_fix_freelist() rejects this request outright the second time around and causes the allocation request to unnecessarily fail for this AG. To address this situation, reset the minalignslop field immediately after use and prevent it from leaking into subsequent requests. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: use NOIO contexts for vm_map_ramDave Chinner
When we map pages in the buffer cache, we can do so in GFP_NOFS contexts. However, the vmap interfaces do not provide any method of communicating this information to memory reclaim, and hence we get lockdep complaining about it regularly and occassionally see hangs that may be vmap related reclaim deadlocks. We can also see these same problems from anywhere where we use vmalloc for a large buffer (e.g. attribute code) inside a transaction context. A typical lockdep report shows up as a reclaim state warning like so: [14046.101458] ================================= [14046.102850] [ INFO: inconsistent lock state ] [14046.102850] 3.14.0-rc4+ #2 Not tainted [14046.102850] --------------------------------- [14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes: [14046.102850] (&xfs_dir_ilock_class){++++?+}, at: [<791a04bb>] xfs_ilock+0xff/0x16a [14046.102850] {RECLAIM_FS-ON-W} state was registered at: [14046.102850] [<7904cdb1>] mark_held_locks+0x81/0xe7 [14046.102850] [<7904d390>] lockdep_trace_alloc+0x5c/0xb4 [14046.102850] [<790c2c28>] kmem_cache_alloc_trace+0x2b/0x11e [14046.102850] [<790ba7f4>] vm_map_ram+0x119/0x3e6 [14046.102850] [<7914e124>] _xfs_buf_map_pages+0x5b/0xcf [14046.102850] [<7914ed74>] xfs_buf_get_map+0x67/0x13f [14046.102850] [<7917506f>] xfs_attr_rmtval_set+0x396/0x4d5 [14046.102850] [<7916e8bb>] xfs_attr_leaf_addname+0x18f/0x37d [14046.102850] [<7916ed9e>] xfs_attr_set_int+0x2f5/0x3e8 [14046.102850] [<7916eefc>] xfs_attr_set+0x6b/0x74 [14046.102850] [<79168355>] xfs_xattr_set+0x61/0x81 [14046.102850] [<790e5b10>] generic_setxattr+0x59/0x68 [14046.102850] [<790e4c06>] __vfs_setxattr_noperm+0x58/0xce [14046.102850] [<790e4d0a>] vfs_setxattr+0x8e/0x92 [14046.102850] [<790e4ddd>] setxattr+0xcf/0x159 [14046.102850] [<790e5423>] SyS_lsetxattr+0x88/0xbb [14046.102850] [<79268438>] sysenter_do_call+0x12/0x36 Now, we can't completely remove these traces - mainly because vm_map_ram() will do GFP_KERNEL allocation and that generates the above warning before we get into the reclaim code, but we can turn them all into false positive warnings. To do that, use the method that DM and other IO context code uses to avoid this problem: there is a process flag to tell memory reclaim not to do IO that we can set appropriately. That prevents GFP_KERNEL context reclaim being done from deep inside the vmalloc code in places we can't directly pass a GFP_NOFS context to. That interface has a pair of wrapper functions: memalloc_noio_save() and memalloc_noio_restore(). Adding them around vm_map_ram and the vzalloc call in kmem_alloc_large() will prevent deadlocks and most lockdep reports for this issue. Also, convert the vzalloc() call in kmem_alloc_large() to use __vmalloc() so that we can pass the correct gfp context to the data page allocation routine inside __vmalloc() so that it is clear that GFP_NOFS context is important to this vmalloc call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: don't leak EFSBADCRC to userspaceDave Chinner
While the verifier routines may return EFSBADCRC when a buffer has a bad CRC, we need to translate that to EFSCORRUPTED so that the higher layers treat the error appropriately and we return a consistent error to userspace. This fixes a xfs/005 regression. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>