summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2015-01-22Merge branch 'xfs-buf-type-fixes' into for-nextDave Chinner
2015-01-22xfs: set superblock buffer type correctlyDave Chinner
When the superblock is modified in a transaction, the commonly modified fields are not actually copied to the superblock buffer to avoid the buffer lock becoming a serialisation point. However, there are some other operations that modify the superblock fields within the transaction that don't directly log to the superblock but rely on the changes to be applied during the transaction commit (to minimise the buffer lock hold time). When we do this, we fail to mark the buffer log item as being a superblock buffer and that can lead to the buffer not being marked with the corect type in the log and hence causing recovery issues. Fix it by setting the type correctly, similar to xfs_mod_sb()... cc: <stable@vger.kernel.org> # 3.10 to current Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: set buf types when converting extent formatsDave Chinner
Conversion from local to extent format does not set the buffer type correctly on the new extent buffer when a symlink data is moved out of line. Fix the symlink code and leave a comment in the generic bmap code reminding us that the format-specific data copy needs to set the destination buffer type appropriately. cc: <stable@vger.kernel.org> # 3.10 to current Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: inode unlink does not set AGI buffer typeDave Chinner
This leads to log recovery throwing errors like: XFS (md0): Mounting V5 Filesystem XFS (md0): Starting recovery (logdev: internal) XFS (md0): Unknown buffer type 0! XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1 ffff8800ffc53800: 58 41 47 49 ..... Which is the AGI buffer magic number. Ensure that we set the type appropriately in both unlink list addition and removal. cc: <stable@vger.kernel.org> # 3.10 to current Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: ensure buffer types are set correctlyDave Chinner
Jan Kara reported that log recovery was finding buffers with invalid types in them. This should not happen, and indicates a bug in the logging of buffers. To catch this, add asserts to the buffer formatting code to ensure that the buffer type is in range when the transaction is committed. We don't set a type on buffers being marked stale - they are not going to get replayed, the format item exists only for recovery to be able to prevent replay of the buffer, so the type does not matter. Hence that needs special casing here. cc: <stable@vger.kernel.org> # 3.10 to current Reported-by: Jan Kara <jack@suse.cz> Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22Merge branch 'xfs-sb-logging-rework' into for-nextDave Chinner
Conflicts: fs/xfs/xfs_mount.c
2015-01-22xfs: sanitise sb_bad_features2 handlingDave Chinner
We currently have to ensure that every time we update sb_features2 that we update sb_bad_features2. Now that we log and format the superblock in it's entirety we actually don't have to care because we can simply update the sb_bad_features2 when we format it into the buffer. This removes the need for anything but the mount and superblock formatting code to care about sb_bad_features2, and hence removes the possibility that we forget to update bad_features2 when necessary in the future. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: consolidate superblock logging functionsDave Chinner
We now have several superblock loggin functions that are identical except for the transaction reservation and whether it shoul dbe a synchronous transaction or not. Consolidate these all into a single function, a single reserveration and a sync flag and call it xfs_sync_sb(). Also, xfs_mod_sb() is not really a modification function - it's the operation of logging the superblock buffer. hence change the name of it to reflect this. Note that we have to change the mp->m_update_flags that are passed around at mount time to a boolean simply to indicate a superblock update is needed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: remove bitfield based superblock updatesDave Chinner
When we log changes to the superblock, we first have to write them to the on-disk buffer, and then log that. Right now we have a complex bitfield based arrangement to only write the modified field to the buffer before we log it. This used to be necessary as a performance optimisation because we logged the superblock buffer in every extent or inode allocation or freeing, and so performance was extremely important. We haven't done this for years, however, ever since the lazy superblock counters pulled the superblock logging out of the transaction commit fast path. Hence we have a bunch of complexity that is not necessary that makes writing the in-core superblock to disk much more complex than it needs to be. We only need to log the superblock now during management operations (e.g. during mount, unmount or quota control operations) so it is not a performance critical path anymore. As such, remove the complex field based logging mechanism and replace it with a simple conversion function similar to what we use for all other on-disk structures. This means we always log the entirity of the superblock, but again because we rarely modify the superblock this is not an issue for log bandwidth or CPU time. Indeed, if we do log the superblock frequently, delayed logging will minimise the impact of this overhead. [Fixed gquota/pquota inode sharing regression noticed by bfoster.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09Merge branch 'xfs-misc-fixes-for-3.20-2' into for-nextDave Chinner
2015-01-09xfs: fix implicit bool to int conversionNicholas Mc Guire
try_wait_for_completion returns bool so the wrapper function xfs_dqflock_nowait should probably also return bool and not int. Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09xfs: pass a 64-bit count argument to xfs_iomap_write_unwrittenChristoph Hellwig
The code is already ready for it, and the pnfs layout commit code expects to be able to pass a larger than 32-bit argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09xfs: remove deprecated sysctlsDave Chinner
xfsbufd_centisecs and age_buffer_centisecs were due for removal in 3.14. We forgot to do that - it's now well past time to remove these deprecated, unused sysctls. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09xfs: move xfs_bmap_finish prototypeDave Chinner
This function is used libxfs code, but is implemented separately in userspace. Move the function prototype to xfs_bmap.h so that the prototype is shared even if the implementations aren't. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09xfs: move struct xfs_bmalloca to libxfsDave Chinner
It no long is used for stack splits, so strip the kernel workqueue bits from it and push it back into libxfs/xfs_bmap.h so that it can be shared with the userspace code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09xfs: move xfs_types.h to libxfsDave Chinner
The types used by the core XFS code are common between kernel and userspace. xfs_types.h is duplicated in both kernel and userspace, so move it to libxfs along with all the other shared code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09xfs: move xfs_fs.h to libxfsDave Chinner
Ioctl API definitions are shared with userspace, so move the header file that defines them all to libxfs along with all the other code shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24Merge branch 'xfs-misc-fixes-for-3.20-1' into for-nextDave Chinner
2014-12-24xfs: Keep sb_bad_features2 consistent with sb_features2Jan Kara
Currently when we modify sb_features2, we store the same value also in sb_bad_features2. However in most places we forget to mark field sb_bad_features2 for logging and thus it can happen that a change to it is lost. This results in an inconsistent sb_features2 and sb_bad_features2 fields e.g. after xfstests test xfs/187. Fix the problem by changing XFS_SB_FEATURES2 to actually mean both sb_features2 and sb_bad_features2 fields since this is always what we want to log. This isn't ideal because the fact that XFS_SB_FEATURES2 means two fields could cause some problem in future however the code is hopefully less error prone that it is now. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24xfs: remove extra newlines from xfs messagesEric Sandeen
xfs_warn() and friends add a newline by default, but some messages add another one. Particularly for the failing write message below, this can waste a lot of console real estate! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24xfs: initialize log buf I/O completion wq on log allocBrian Foster
Log buffer I/O completion passes through the high priority m_log_workqueue rather than the default metadata buffer workqueue. The log buffer wq is initialized at I/O submission time. The log buffers are reused once initialized, however, so this is not necessary. Initialize the log buffer I/O completion workqueue pointers once when the log is allocated and log buffers initialized rather than on every log buffer I/O submission. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24xfs: Add support to RENAME_EXCHANGE flagCarlos Maiolino
Adds a new function named xfs_cross_rename(), responsible for handling requests from sys_renameat2() using RENAME_EXCHANGE flag. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24xfs: Make xfs_vn_rename compliant with renameat2() syscallCarlos Maiolino
To be able to support RENAME_EXCHANGE flag from renameat2() system call, XFS must have its inode_operations updated, exporting .rename2 method, instead of .rename. This patch just replaces the (now old) .rename method by .rename2, using the same infra-structure, but checking rename flags. Calls to .rename2 using RENAME_EXCHANGE flag, although now handled inside XFS, still return -EINVAL. RENAME_NOREPLACE is handled via VFS and we don't need to care about it inside xfs_vn_rename. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile #3 from Al Viro: "Assorted fixes and patches from the last cycle" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [regression] chunk lost from bd9b51 vfs: make mounts and mountstats honor root dir like mountinfo does vfs: cleanup show_mountinfo init: fix read-write root mount unfuck binfmt_misc.c (broken by commit e6084d4) vm_area_operations: kill ->migrate() new helper: iter_is_iovec() move_extent_per_page(): get rid of unused w_flags lustre: get rid of playing with ->fs btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
2014-12-19Merge tag 'ecryptfs-3.19-rc1-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs Pull eCryptfs fixes from Tyler Hicks: "Fixes for filename decryption and encrypted view plus a cleanup - The filename decryption routines were, at times, writing a zero byte one character past the end of the filename buffer - The encrypted view feature attempted, and failed, to roll its own form of enforcing a read-only mount instead of letting the VFS enforce it" * tag 'ecryptfs-3.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: eCryptfs: Remove buggy and unnecessary write in file name decode routine eCryptfs: Remove unnecessary casts when parsing packet lengths eCryptfs: Force RO mount when encrypted view is enabled
2014-12-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull more btrfs updates from Chris Mason: "This is part two of our merge window patches. These are all from Filipe, and fix some really hard to find races that can cause corruptions. Most of them involved block group removal (balance) or discard" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: remove non-sense btrfs_error_discard_extent() function Btrfs: fix fs corruption on transaction abort if device supports discard Btrfs: always clear a block group node when removing it from the tree Btrfs: ensure deletion from pinned_chunks list is protected
2014-12-19Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq core fix from Thomas Gleixner: "A single fix plugging a long standing race between proc/stat and proc/interrupts access and freeing of interrupt descriptors" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Prevent proc race against freeing of irq descriptors
2014-12-18Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc patches from Andrew Morton: "A few stragglers" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: tools/testing/selftests/Makefile: alphasort the TARGETS list mm/zsmalloc: adjust order of functions ocfs2: fix journal commit deadlock ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_worker ocfs2: reflink: fix slow unlink for refcounted file mm/memory.c:do_shared_fault(): add comment .mailmap: Santosh Shilimkar has moved .mailmap: update akpm@osdl.org lib/show_mem.c: add cma reserved information fs/proc/meminfo.c: include cma info in proc/meminfo mm: cma: split cma-reserved in dmesg log hfsplus: fix longname handling mm/mempolicy.c: remove unnecessary is_valid_nodemask()
2014-12-18ocfs2: fix journal commit deadlockJunxiao Bi
For buffer write, page lock will be got in write_begin and released in write_end, in ocfs2_write_end_nolock(), before it unlock the page in ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask for the read lock of journal->j_trans_barrier. Holding page lock and ask for journal->j_trans_barrier breaks the locking order. This will cause a deadlock with journal commit threads, ocfs2cmt will get write lock of journal->j_trans_barrier first, then it wakes up kjournald2 to do the commit work, at last it waits until done. To commit journal, kjournald2 needs flushing data first, it needs get the cache page lock. Since some ocfs2 cluster locks are holding by write process, this deadlock may hung the whole cluster. unlock pages before ocfs2_run_deallocs() can fix the locking order, also put unlock before ocfs2_commit_trans() to make page lock is unlocked before j_trans_barrier to preserve unlocking order. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: <stable@vger.kernel.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_workerJoseph Qi
Commit ac4fef4d23ed ("ocfs2/dlm: do not purge lockres that is queued for assert master") may have the following possible race case: dlm_dispatch_assert_master dlm_wq ======================================================================== queue_work(dlm->quedlm_worker, &dlm->dispatched_work); dispatch work, dlm_lockres_drop_inflight_worker *BUG_ON(res->inflight_assert_workers == 0)* dlm_lockres_grab_inflight_worker inflight_assert_workers++ So ensure inflight_assert_workers to be increased first. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Xue jiufei <xuejiufei@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18ocfs2: reflink: fix slow unlink for refcounted fileJunxiao Bi
When running ocfs2 test suite multiple nodes reflink stress test, for a 4 nodes cluster, every unlink() for refcounted file needs about 700s. The slow unlink is caused by the contention of refcount tree lock since all nodes are unlink files using the same refcount tree. When the unlinking file have many extents(over 1600 in our test), most of the extents has refcounted flag set. In ocfs2_commit_truncate(), it will execute the following call trace for every extents. This means it needs get and released refcount tree lock about 1600 times. And when several nodes are do this at the same time, the performance will be very low. ocfs2_remove_btree_range() -- ocfs2_lock_refcount_tree() ---- ocfs2_refcount_lock() ------ __ocfs2_cluster_lock() ocfs2_refcount_lock() is costly, move it to ocfs2_commit_truncate() to do lock/unlock once can improve a lot performance. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Wengang <wen.gang.wang@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18fs/proc/meminfo.c: include cma info in proc/meminfoPintu Kumar
This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo. Currently, in a CMA enabled system, if somebody wants to know the total CMA size declared, there is no way to tell, other than the dmesg or /var/log/messages logs. With this patch we are showing the CMA info as part of meminfo, so that it can be determined at any point of time. This will be populated only when CMA is enabled. Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB. MemTotal: 471172 kB MemFree: 111712 kB MemAvailable: 271172 kB . . . CmaTotal: 16384 kB CmaFree: 6144 kB This patch also fix below checkpatch errors that were found during these changes. ERROR: space required after that ',' (ctx:ExV) 199: FILE: fs/proc/meminfo.c:199: + ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10) ^ ERROR: space required after that ',' (ctx:ExV) 202: FILE: fs/proc/meminfo.c:202: + ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) * ^ ERROR: space required after that ',' (ctx:ExV) 206: FILE: fs/proc/meminfo.c:206: + ,K(totalcma_pages) ^ total: 3 errors, 0 warnings, 2 checks, 236 lines checked Signed-off-by: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18hfsplus: fix longname handlingSougata Santra
Longname is not correctly handled by hfsplus driver. If an attempt to create a longname(>255) file/directory is made, it succeeds by creating a file/directory with HFSPLUS_MAX_STRLEN and incorrect catalog key. Thus leaving the volume in an inconsistent state. This patch fixes this issue. Although lookup is always called first to create a negative entry, so just doing a check in lookup would probably fix this issue. I choose to propagate error to other iops as well. Please NOTE: I have factored out hfsplus_cat_build_key_with_cnid from hfsplus_cat_build_key, to avoid unncessary branching. Thanks a lot. TEST: ------ dir="TEST_DIR" cdir=`pwd` name255="_123456789_123456789_123456789_123456789_123456789_123456789\ _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ _123456789_123456789_123456789_123456789_123456789_1234" name256="${name255}5" mkdir $dir cd $dir touch $name255 rm -f $name255 touch $name256 ls -la cd $cdir rm -rf $dir RESULT: ------- [sougata@ultrabook tmp]$ cdir=`pwd` [sougata@ultrabook tmp]$ name255="_123456789_123456789_123456789_123456789_123456789_123456789\ > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ > _123456789_123456789_123456789_123456789_123456789_1234" [sougata@ultrabook tmp]$ name256="${name255}5" [sougata@ultrabook tmp]$ [sougata@ultrabook tmp]$ mkdir $dir [sougata@ultrabook tmp]$ cd $dir [sougata@ultrabook TEST_DIR]$ touch $name255 [sougata@ultrabook TEST_DIR]$ rm -f $name255 [sougata@ultrabook TEST_DIR]$ touch $name256 [sougata@ultrabook TEST_DIR]$ ls -la ls: cannot access _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234: No such file or directory total 0 drwxrwxr-x 1 sougata sougata 3 Feb 20 19:56 . drwxrwxrwx 1 root root 6 Feb 20 19:56 .. -????????? ? ? ? ? ? _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234 [sougata@ultrabook TEST_DIR]$ cd $cdir [sougata@ultrabook tmp]$ rm -rf $dir rm: cannot remove `TEST_DIR': Directory not empty -ENAMETOOLONG returned from hfsplus_asc2uni was not propaged to iops. This allowed hfsplus to create files/directories with HFSPLUS_MAX_STRLEN and incorrect keys, leaving the FS in an inconsistent state. This patch fixes this issue. Signed-off-by: Sougata Santra <sougata@tuxera.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18mnt: Fix a memory stomp in umountEric W. Biederman
While reviewing the code of umount_tree I realized that when we append to a preexisting unmounted list we do not change pprev of the former first item in the list. Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on the former first item of the list will stomp unmounted.first leaving it set to some random mount point which we are likely to free soon. This isn't likely to hit, but if it does I don't know how anyone could track it down. [ This happened because we don't have all the same operations for hlist's as we do for normal doubly-linked lists. In particular, list_splice() is easy on our standard doubly-linked lists, while hlist_splice() doesn't exist and needs both start/end entries of the hlist. And commit 38129a13e6e7 incorrectly open-coded that missing hlist_splice(). We should think about making these kinds of "mindless" conversions easier to get right by adding the missing hlist helpers - Linus ] Fixes: 38129a13e6e71f666e0468e99fdd932a687b4d7e switch mnt_hash to hlist Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-17Ceph: remove left-over reject fileLinus Torvalds
Neither Sage nor I noticed that Zheng Yan had mistakenly committed fs/ceph/super.h.rej as part of commit 31c542a199d7 ("ceph: add inline data to pagecache"). Remove it. Requested-by: Yan, Zheng <ukernel@gmail.com> Cc: Sage Weil <sweil@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "The big item here is support for inline data for CephFS and for message signatures from Zheng. There are also several bug fixes, including interrupted flock request handling, 0-length xattrs, mksnap, cached readdir results, and a message version compat field. Finally there are several cleanups from Ilya, Dan, and Markus. Note that there is another series coming soon that fixes some bugs in the RBD 'lingering' requests, but it isn't quite ready yet" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) ceph: fix setting empty extended attribute ceph: fix mksnap crash ceph: do_sync is never initialized libceph: fixup includes in pagelist.h ceph: support inline data feature ceph: flush inline version ceph: convert inline data to normal data before data write ceph: sync read inline data ceph: fetch inline data when getting Fcr cap refs ceph: use getattr request to fetch inline data ceph: add inline data to pagecache ceph: parse inline data in MClientReply and MClientCaps libceph: specify position of extent operation libceph: add CREATE osd operation support libceph: add SETXATTR/CMPXATTR osd operations support rbd: don't treat CEPH_OSD_OP_DELETE as extent op ceph: remove unused stringification macros libceph: require cephx message signature by default ceph: introduce global empty snap context ceph: message versioning fixes ...
2014-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace related fixes from Eric Biederman: "As these are bug fixes almost all of thes changes are marked for backporting to stable. The first change (implicitly adding MNT_NODEV on remount) addresses a regression that was created when security issues with unprivileged remount were closed. I go on to update the remount test to make it easy to detect if this issue reoccurs. Then there are a handful of mount and umount related fixes. Then half of the changes deal with the a recently discovered design bug in the permission checks of gid_map. Unix since the beginning has allowed setting group permissions on files to less than the user and other permissions (aka ---rwx---rwx). As the unix permission checks stop as soon as a group matches, and setgroups allows setting groups that can not later be dropped, results in a situtation where it is possible to legitimately use a group to assign fewer privileges to a process. Which means dropping a group can increase a processes privileges. The fix I have adopted is that gid_map is now no longer writable without privilege unless the new file /proc/self/setgroups has been set to permanently disable setgroups. The bulk of user namespace using applications even the applications using applications using user namespaces without privilege remain unaffected by this change. Unfortunately this ix breaks a couple user space applications, that were relying on the problematic behavior (one of which was tools/selftests/mount/unprivileged-remount-test.c). To hopefully prevent needing a regression fix on top of my security fix I rounded folks who work with the container implementations mostly like to be affected and encouraged them to test the changes. > So far nothing broke on my libvirt-lxc test bed. :-) > Tested with openSUSE 13.2 and libvirt 1.2.9. > Tested-by: Richard Weinberger <richard@nod.at> > Tested on Fedora20 with libvirt 1.2.11, works fine. > Tested-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com> > Ok, thanks - yes, unprivileged lxc is working fine with your kernels. > Just to be sure I was testing the right thing I also tested using > my unprivileged nsexec testcases, and they failed on setgroup/setgid > as now expected, and succeeded there without your patches. > Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com> > I tested this with Sandstorm. It breaks as is and it works if I add > the setgroups thing. > Tested-by: Andy Lutomirski <luto@amacapital.net> # breaks things as designed :(" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Unbreak the unprivileged remount tests userns; Correct the comment in map_write userns: Allow setting gid_maps without privilege when setgroups is disabled userns: Add a knob to disable setgroups on a per user namespace basis userns: Rename id_map_mutex to userns_state_mutex userns: Only allow the creator of the userns unprivileged mappings userns: Check euid no fsuid when establishing an unprivileged uid mapping userns: Don't allow unprivileged creation of gid mappings userns: Don't allow setgroups until a gid mapping has been setablished userns: Document what the invariant required for safe unprivileged mappings. groups: Consolidate the setgroups permission checks mnt: Clear mnt_expire during pivot_root mnt: Carefully set CL_UNPRIVILEGED in clone_mnt mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers. umount: Do not allow unmounting rootfs. umount: Disallow unprivileged mount force mnt: Update unprivileged remount test mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
2014-12-17Merge tag 'for-linus-20141215' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull MTD updates from Brian Norris: "Summary: - Add device tree support for DoC3 - SPI NOR: Refactoring, for better layering between spi-nor.c and its driver users (e.g., m25p80.c) New flash device support Support 6-byte ID strings - NAND: New NAND driver for Allwinner SoC's (sunxi) GPMI NAND: add support for raw (no ECC) access, for testing purposes Add ATO manufacturer ID A few odd driver fixes - MTD tests: Allow testers to compensate for OOB bitflips in oobtest Fix a torturetest regression - nandsim: Support longer ID byte strings And more" * tag 'for-linus-20141215' of git://git.infradead.org/linux-mtd: (63 commits) mtd: tests: abort torturetest on erase errors mtd: physmap_of: fix potential NULL dereference mtd: spi-nor: allow NULL as chip name and try to auto detect it mtd: nand: gpmi: add raw oob access functions mtd: nand: gpmi: add proper raw access support mtd: nand: gpmi: add gpmi_copy_bits function mtd: spi-nor: factor out write_enable() for erase commands mtd: spi-nor: add support for s25fl128s mtd: spi-nor: remove the jedec_id/ext_id mtd: spi-nor: add id/id_len for flash_info{} mtd: nand: correct the comment of function nand_block_isreserved() jffs2: Drop bogus if in comment mtd: atmel_nand: replace memcpy32_toio/memcpy32_fromio with memcpy mtd: cafe_nand: drop duplicate .write_page implementation mtd: m25p80: Add support for serial flash Spansion S25FL132K MTD: m25p80: fix inconsistency in m25p_ids compared to spi_nor_ids mtd: spi-nor: improve wait-till-ready timeout loop mtd: delete unnecessary checks before two function calls mtd: nand: omap: Fix NAND enumeration on 3430 LDP mtd: nand: add ATO manufacturer info ...
2014-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: "The first part makes sure we don't hold up umount with pending async requests. In addition to being a cleanup, this is a small behavioral change (for the better) and unlikely to break anything. The second part prepares for a cleanup of the fuse device I/O code by adding a helper for simple request submission, with some savings in line numbers already realized" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use file_inode() in fuse_file_fallocate() fuse: introduce fuse_simple_request() helper fuse: reduce max out args fuse: hold inode instead of path after release fuse: flush requests on umount fuse: don't wake up reserved req in fuse_conn_kill()
2014-12-17ceph: fix setting empty extended attributeYan, Zheng
make sure 'value' is not null. otherwise __ceph_setxattr will remove the extended attribute. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17ceph: fix mksnap crashYan, Zheng
mksnap reply only contain 'target', does not contain 'dentry'. So it's wrong to use req->r_reply_info.head->is_dentry to detect traceless reply. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17ceph: do_sync is never initializedDan Carpenter
Probably this code was syncing a lot more often then intended because the do_sync variable wasn't set to zero. Cc: stable@vger.kernel.org # v3.11+ Fixes: c62988ec0910 ('ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17ceph: support inline data featureYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: flush inline versionYan, Zheng
After converting inline data to normal data, client need to flush the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes cap messages (sent to MDS) contain inline_version and inline_data. Client always converts inline data to normal data before data write, so the inline data length part is always zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: convert inline data to normal data before data writeYan, Zheng
Before any data write, convert inline data to normal data and set i_inline_version to CEPH_INLINE_NONE. The OSD request that saves inline data to object contains 3 operations (CMPXATTR, WRITE and SETXATTR). It compares a xattr named 'inline_version' to prevent old data overwrites newer data. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: sync read inline dataYan, Zheng
we can't use getattr to fetch inline data while holding Fr cap, because it can cause deadlock. If we need to sync read inline data, drop cap refs first, then use getattr to fetch inline data. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: fetch inline data when getting Fcr cap refsYan, Zheng
we can't use getattr to fetch inline data after getting Fcr caps, because it can cause deadlock. The solution is try bringing inline data to page cache when not holding any cap, and hope the inline data page is still there after getting the Fcr caps. If the page is still there, pin it in page cache for later IO. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: use getattr request to fetch inline dataYan, Zheng
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data in getattr reply will be copied to the page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: add inline data to pagecacheYan, Zheng
Request reply and cap message can contain inline data. add inline data to the page cache if there is Fc cap. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: parse inline data in MClientReply and MClientCapsYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com>