summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2014-11-12btrfs: do commit in sync_fs if there are pending changesDavid Sterba
If a pending change is requested, it's not processed unless there is a transaction commit about to happen, not even after sync or SYNC_FS ioctl. For example a remount that toggles the inode_cache option will not take effect after sync on a quiescent filesystem. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12btrfs: add support for processing pending changesDavid Sterba
There are some actions that modify global filesystem state but cannot be performed at the time of request, but later at the transaction commit time when the filesystem is in a known state. For example enabling new incompat features on-the-fly or issuing transaction commit from unsafe contexts (sysfs handlers). Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-11efivarfs: Allow unloading when build as moduleMathias Krause
There is no need to keep the module loaded when it serves no function in case the EFI runtime services are disabled. Return an error in this case so loading the module will fail. Also supply a module_exit function to allow unloading the module. Last, but not least, set the owner of the file_system_type struct. Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-11-11f2fs: convert inline_data when i_size becomes largeJaegeuk Kim
If i_size becomes large outside of MAX_INLINE_DATA, we shoud convert the inode. Otherwise, we can make some dirty pages during the truncation, and those pages will be written through f2fs_write_data_page. At that moment, the inode has still inline_data, so that it tries to write non- zero pages into inline_data area. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-11f2fs: fix deadlock to grab 0'th data pageJaegeuk Kim
The scenario is like this. One trhead triggers: f2fs_write_data_pages lock_page f2fs_write_data_page f2fs_lock_op <- wait The other thread triggers: f2fs_truncate truncate_blocks f2fs_lock_op truncate_partial_data_page lock_page <- wait for locking the page This patch resolves this bug by relocating truncate_partial_data_page. This function is just to truncate user data page and not related to FS consistency as well. And, we don't need to call truncate_inline_data. Rather than that, f2fs_write_data_page will finally update inline_data later. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-10f2fs: reduce the number of inline_data inode before clearing itJaegeuk Kim
The # of inline_data inode is decreased only when it has inline_data. After clearing the flag, we can't decreased the number. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-10f2fs: implement -o dirsyncJaegeuk Kim
If a mount option has dirsync, we should call checkpoint for all the directory operations. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-10f2fs: do not skip any writes under memory pressureJaegeuk Kim
Under memory pressure, let's avoid skipping data writes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-10f2fs: write node pages if checkpoint is not doingJaegeuk Kim
It needs to write node pages if checkpoint is not doing in order to avoid memory pressure. Reviewed-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-10vfs: Remove i_dquot field from inodeJan Kara
All filesystems using VFS quotas are now converted to use their private i_dquot fields. Remove the i_dquot field from generic inode structure. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10jfs: Convert to private i_dquot fieldJan Kara
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> CC: jfs-discussion@lists.sourceforge.net Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10reiserfs: Convert to private i_dquot fieldJan Kara
CC: reiserfs-devel@vger.kernel.org CC: Jeff Mahoney <jeffm@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10ocfs2: Convert to private i_dquot fieldJan Kara
CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10ext4: Convert to private i_dquot fieldJan Kara
CC: linux-ext4@vger.kernel.org Acked-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10ext3: Convert to private i_dquot fieldJan Kara
CC: linux-ext4@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10ext2: Convert to private i_dquot fieldJan Kara
CC: linux-ext4@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10quota: Use function to provide i_dquot pointersJan Kara
i_dquot array is used by relatively few filesystems (ext?, ocfs2, jfs, reiserfs) so it is beneficial to move this array to fs-private part of the inode. We cannot just pass quota pointers from filesystems to quota functions because during quotaon and quotaoff we have to traverse list of all inodes and manipulate i_dquot pointers for each inode. So we provide a function which generic quota code can use to get pointer to the i_dquot array from the filesystem. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10xfs: Set allowed quota typesJan Kara
We support user, group, and project quotas. Tell VFS about it. CC: xfs@oss.sgi.com CC: Dave Chinner <david@fromorbit.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10gfs2: Set allowed quota typesJan Kara
We support user and group quotas. Tell vfs about it. Acked-by: Steven Whitehouse <swhiteho@redhat.com> CC: cluster-devel@redhat.com Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10quota: Allow each filesystem to specify which quota types it supportsJan Kara
Currently all filesystems supporting VFS quota support user and group quotas. With introduction of project quotas this is going to change so make sure filesystem isn't called for quota type it doesn't support by introduction of a bitmask determining which quota types each filesystem supports. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10quota: Remove const from function declarationsJan Kara
We don't use const through VFS too much so just remove it from quota function declarations. Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "It's a one liner for an error cleanup path that leads to crashes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
2014-11-07Merge tag 'xfs-for-linus-3.18-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This update fixes a warning in the new pagecache_isize_extended() and updates some related comments, another fix for zero-range misbehaviour, and an unforntuately large set of fixes for regressions in the bulkstat code. The bulkstat fixes are large but necessary. I wouldn't normally push such a rework for a -rcX update, but right now xfsdump can silently create incomplete dumps on 3.17 and it's possible that even xfsrestore won't notice that the dumps were incomplete. Hence we need to get this update into 3.17-stable kernels ASAP. In more detail, the refactoring work I committed in 3.17 has exposed a major hole in our QA coverage. With both xfsdump (the major user of bulkstat) and xfsrestore silently ignoring missing files in the dump/restore process, incomplete dumps were going unnoticed if they were being triggered. Many of the dump/restore filesets were so small that they didn't evenhave a chance of triggering the loop iteration bugs we introduced in 3.17, so we didn't exercise the code sufficiently, either. We have already taken steps to improve QA coverage in xfstests to avoid this happening again, and I've done a lot of manual verification of dump/restore on very large data sets (tens of millions of inodes) of the past week to verify this patch set results in bulkstat behaving the same way as it does on 3.16. Unfortunately, the fixes are not exactly simple - in tracking down the problem historic API warts were discovered (e.g xfsdump has been working around a 20 year old bug in the bulkstat API for the past 10 years) and so that complicated the process of diagnosing and fixing the problems. i.e. we had to fix bugs in the code as well as discover and re-introduce the userspace visible API bugs that we unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly. Summary: - incorrect warnings about i_mutex locking in pagecache_isize_extended() and updates comments to match expected locking - another zero-range bug fix for stray file size updates - a bunch of fixes for regression in the bulkstat code introduced in 3.17" * tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: track bulkstat progress by agino xfs: bulkstat error handling is broken xfs: bulkstat main loop logic is a mess xfs: bulkstat chunk-formatter has issues xfs: bulkstat chunk formatting cursor is broken xfs: bulkstat btree walk doesn't terminate mm: Fix comment before truncate_setsize() xfs: rework zero range to prevent invalid i_size updates mm: Remove false WARN_ON from pagecache_isize_extended() xfs: Check error during inode btree iteration in xfs_bulkstat() xfs: bulkstat doesn't release AGI buffer on error
2014-11-07nfsd: convert nfs4_file searches to use RCUJeff Layton
The global state_lock protects the file_hashtbl, and that has the potential to be a scalability bottleneck. Address this by making the file_hashtbl use RCU. Add a rcu_head to the nfs4_file and use that when freeing ones that have been hashed. In order to conserve space, we union the fi_rcu field with the fi_delegations list_head which must be clear by the time the last reference to the file is dropped. Convert find_file_locked to use RCU lookup primitives and not to require that the state_lock be held, and convert find_file to do a lockless lookup. Convert find_or_add_file to attempt a lockless lookup first, and then fall back to doing a locked search and insert if that fails to find anything. Also, minimize the number of times we need to calculate the hash value by passing it in as an argument to the search and insert functions, and optimize the order of arguments in nfsd4_init_file. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-07nfsd: Add DEALLOCATE supportAnna Schumaker
DEALLOCATE only returns a status value, meaning we can use the noop() xdr encoder to reply to the client. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-07nfsd: Add ALLOCATE supportAnna Schumaker
The ALLOCATE operation is used to preallocate space in a file. I can do this by using vfs_fallocate() to do the actual preallocation. ALLOCATE only returns a status indicator, so we don't need to write a special encode() function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-07VFS: Rename do_fallocate() to vfs_fallocate()Anna Schumaker
This function needs to be exported so it can be used by the NFSD module when responding to the new ALLOCATE and DEALLOCATE operations in NFS v4.2. Christoph Hellwig suggested renaming the function to stay consistent with how other vfs functions are named. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-07sysfs/kernfs: make read requests on pre-alloc files use the buffer.NeilBrown
To match the previous patch which used the pre-alloc buffer for writes, this patch causes reads to use the same buffer. This is not strictly necessary as the current seq_read() will allocate on first read, so user-space can trigger the required pre-alloc. But consistency is valuable. The read function is somewhat simpler than seq_read() and, for example, does not support reading from an offset into the file: reads must be at the start of the file. As seq_read() does not use the prealloc buffer, ->seq_show is incompatible with ->prealloc and caused an EINVAL return from open(). sysfs code which calls into kernfs always chooses the correct function. As the buffer is shared with writes and other reads, the mutex is extended to cover the copy_to_user. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-07sysfs/kernfs: allow attributes to request write buffer be pre-allocated.NeilBrown
md/raid allows metadata management to be performed in user-space. A various times, particularly on device failure, the metadata needs to be updated before further writes can be permitted. This means that the user-space program which updates metadata much not block on writeout, and so must not allocate memory. mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all memory allocation issues for user-memory, but that does not help kernel memory. Several kernel objects can be pre-allocated. e.g. files opened before any writes to the array are permitted. However some kernel allocation happens in places that cannot be pre-allocated. In particular, writes to sysfs files (to tell md that it can now allow writes to the array) allocate a buffer using GFP_KERNEL. This patch allows attributes to be marked as "PREALLOC". In that case the maximal buffer is allocated when the file is opened, and then used on each write instead of allocating a new buffer. As the same buffer is now shared for all writes on the same file description, the mutex is extended to cover full use of the buffer including the copy_from_user(). The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is inspected by sysfs_add_file_mode_ns() to determine if the file should be marked as requiring prealloc. Despite the comment, we *do* use ->seq_show together with ->prealloc in this patch. The next patch fixes that. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-07fs: sysfs: return EGBIG on write if offset is larger than file sizeVladimir Zapolskiy
According to the user expectations common utilities like dd or sh redirection operator > should work correctly over binary files from sysfs. At the moment doing excessive write can not be completed: write(1, "\0\0\0\0\0\0\0\0", 8) = 4 write(1, "\0\0\0\0", 4) = 0 write(1, "\0\0\0\0", 4) = 0 write(1, "\0\0\0\0", 4) = 0 ... Fix the problem by returning EFBIG described in man 2 write. Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-07UBIFS: fix a couple bugs in UBIFS xattr length calculationSubodh Nijsure
The journal update function did not work for extended attributes properly, because extended attribute inodes carry the xattr data, and the size of this data was not taken into account. Artem: improved commit message, amended the patch a bit. Signed-off-by: Subodh Nijsure <snijsure@grid-net.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Ben Shelton <ben.shelton@ni.com> Acked-by: Brad Mouring <brad.mouring@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-11-07UBIFS: fix budget leak in error pathArtem Bityutskiy
We forgot to free the budget in 'write_begin_slow()' when 'do_readpage()' fails. This patch fixes the issue. Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-11-06f2fs: control the memory footprint used by ino entriesJaegeuk Kim
This patch adds to control the memory footprint used by ino entries. This will conduct best effort, not strictly. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-06f2fs: introduce the number of inode entriesJaegeuk Kim
This patch adds to monitor the number of ino entries. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-07xfs: track bulkstat progress by aginoDave Chinner
The bulkstat main loop progress is tracked by the "lastino" variable, which is a full 64 bit inode. However, the loop actually works on agno/agino pairs, and so there's a significant disconnect between the rest of the loop and the main cursor. Convert this to use the agino, and pass the agino into the chunk formatting function and convert it too. This gets rid of the inconsistency in the loop processing, and finally makes it simple for us to skip inodes at any point in the loop simply by incrementing the agino cursor. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat error handling is brokenDave Chinner
The error propagation is a horror - xfs_bulkstat() returns a rval variable which is only set if there are formatter errors. Any sort of btree walk error or corruption will cause the bulkstat walk to terminate but will not pass an error back to userspace. Worse is the fact that formatter errors will also be ignored if any inodes were correctly formatted into the user buffer. Hence bulkstat can fail badly yet still report success to userspace. This causes significant issues with xfsdump not dumping everything in the filesystem yet reporting success. It's not until a restore fails that there is any indication that the dump was bad and tha bulkstat failed. This patch now triggers xfsdump to fail with bulkstat errors rather than silently missing files in the dump. This now causes bulkstat to fail when the lastino cookie does not fall inside an existing inode chunk. The pre-3.17 code tolerated that error by allowing the code to move to the next inode chunk as the agino target is guaranteed to fall into the next btree record. With the fixes up to this point in the series, xfsdump now passes on the troublesome filesystem image that exposes all these bugs. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2014-11-07xfs: bulkstat main loop logic is a messDave Chinner
There are a bunch of variables tha tare more wildy scoped than they need to be, obfuscated user buffer checks and tortured "next inode" tracking. This all needs cleaning up to expose the real issues that need fixing. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat chunk-formatter has issuesDave Chinner
The loop construct has issues: - clustidx is completely unused, so remove it. - the loop tries to be smart by terminating when the "freecount" tells it that all inodes are free. Just drop it as in most cases we have to scan all inodes in the chunk anyway. - move the "user buffer left" condition check to the only point where we consume space int eh user buffer. - move the initialisation of agino out of the loop, leaving just a simple loop control logic using the clusteridx. Also, double handling of the user buffer variables leads to problems tracking the current state - use the cursor variables directly rather than keeping local copies and then having to update the cursor before returning. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat chunk formatting cursor is brokenDave Chinner
The xfs_bulkstat_agichunk formatting cursor takes buffer values from the main loop and passes them via the structure to the chunk formatter, and the writes the changed values back into the main loop local variables. Unfortunately, this complex dance is full of corner cases that aren't handled correctly. The biggest problem is that it is double handling the information in both the main loop and the chunk formatting function, leading to inconsistent updates and endless loops where progress is not made. To fix this, push the struct xfs_bulkstat_agichunk outwards to be the primary holder of user buffer information. this removes the double handling in the main loop. Also, pass the last inode processed by the chunk formatter as a separate parameter as it purely an output variable and is not related to the user buffer consumption cursor. Finally, the chunk formatting code is not shared by anyone, so make it local to xfs_itable.c. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-07xfs: bulkstat btree walk doesn't terminateDave Chinner
The bulkstat code has several different ways of detecting the end of an AG when doing a walk. They are not consistently detected, and the code that checks for the end of AG conditions is not consistently coded. Hence the are conditions where the walk code can get stuck in an endless loop making no progress and not triggering any termination conditions. Convert all the "tmp/i" status return codes from btree operations to a common name (stat) and apply end-of-ag detection to these operations consistently. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-06lockd: ratelimit "lockd: cannot monitor" messagesJeff Layton
When lockd can't talk to a remote statd, it'll spew a warning message to the ring buffer. If the application is really hammering on locks however, it's possible for that message to spam the logs. Ratelimit it to minimize the potential for harm. Reported-by: Ian Collier <imc@cs.ox.ac.uk> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-06aio: fix uncorrent dirty pages accouting when truncating AIO ring bufferGu Zheng
https://bugzilla.kernel.org/show_bug.cgi?id=86831 Markus reported that when shutting down mysqld (with AIO support, on a ext3 formatted Harddrive) leads to a negative number of dirty pages (underrun to the counter). The negative number results in a drastic reduction of the write performance because the page cache is not used, because the kernel thinks it is still 2 ^ 32 dirty pages open. Add a warn trace in __dec_zone_state will catch this easily: static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item) { atomic_long_dec(&zone->vm_stat[item]); + WARN_ON_ONCE(item == NR_FILE_DIRTY && atomic_long_read(&zone->vm_stat[item]) < 0); atomic_long_dec(&vm_stat[item]); } [ 21.341632] ------------[ cut here ]------------ [ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242 cancel_dirty_page+0x164/0x224() [ 21.355296] Modules linked in: wutbox_cp sata_mv [ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80 [ 21.366793] Workqueue: events free_ioctx [ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>] (show_stack+0x20/0x24) [ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>] (dump_stack+0x24/0x28) [ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>] (warn_slowpath_common+0x84/0x9c) [ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>] (warn_slowpath_null+0x2c/0x34) [ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>] (cancel_dirty_page+0x164/0x224) [ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>] (truncate_inode_page+0x8c/0x158) [ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>] (truncate_inode_pages_range+0x11c/0x53c) [ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from [<c00c0f6c>] (truncate_pagecache+0x88/0xac) [ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>] (truncate_setsize+0x5c/0x74) [ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>] (put_aio_ring_file.isra.14+0x34/0x90) [ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from [<c013b424>] (aio_free_ring+0x20/0xcc) [ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>] (free_ioctx+0x24/0x44) [ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>] (process_one_work+0x134/0x47c) [ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>] (worker_thread+0x130/0x414) [ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>] (kthread+0xd4/0xec) [ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>] (ret_from_fork+0x14/0x20) [ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]--- The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty (bypasses the VFS dirty pages increment) when init, and aio fs uses *default_backing_dev_info* as the backing dev, which does not disable the dirty pages accounting capability. So truncating aio ring file will contribute to accounting dirty pages (VFS dirty pages decrement), then error occurs. The original goal is keeping these pages in memory (can not be reclaimed or swapped) in life-time via marking it dirty. But thinking more, we have already pinned pages via elevating the page's refcount, which can already achieve the goal, so the SetPageDirty seems unnecessary. In order to fix the issue, using the __set_page_dirty_no_writeback instead of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually set the dirty flags, don't disable set_page_dirty(), rely on default behaviour). With the above change, the dirty pages accounting can work well. But as we known, aio fs is an anonymous one, which should never cause any real write-back, we can ignore the dirty pages (write back) accounting by disabling the dirty pages (write back) accounting capability. So we introduce an aio private backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to replace the default one. Reported-by: Markus Königshaus <m.koenigshaus@wut.de> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable <stable@vger.kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-11-05f2fs: disable roll-forward when active_logs = 2Jaegeuk Kim
The roll-forward mechanism should be activated when the number of active logs is not 2. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-05fix breakage in o2net_send_tcp_msg()Al Viro
uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()" by me ;-/ Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-05debugfs: Have debugfs_print_regs32() return voidJoe Perches
The seq_printf() will soon just return void, and seq_has_overflowed() should be used instead to see if the seq can no longer accept input. As the return value of debugfs_print_regs32() has no users and the seq_file descriptor should be checked with seq_has_overflowed() instead of return values of functions, it is better to just have debugfs_print_regs32() also return void. Link: http://lkml.kernel.org/p/2634b19eb1c04a9d31148c1fe6f1f3819be95349.1412031505.git.joe@perches.com Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Joe Perches <joe@perches.com> [ original change only updated seq_printf() return, added return of void to debugfs_print_regs32() as well ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-05fs: Convert show_fdinfo functions to voidJoe Perches
seq_printf functions shouldn't really check the return value. Checking seq_has_overflowed() occasionally is used instead. Update vfs documentation. Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com Cc: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> [ did a few clean ups ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-05dlm: Use seq_puts() instead of seq_printf() for constant stringsJoe Perches
Convert the seq_printf output with constant strings to seq_puts. Link: http://lkml.kernel.org/p/b416b016f4a6e49115ba736cad6ea2709a8bc1c4.1412031505.git.joe@perches.com Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-05dlm: Remove seq_printf() return checks and use seq_has_overflowed()Joe Perches
The seq_printf() return is going away soon and users of it should check seq_has_overflowed() to see if the buffer is full and will not accept any more data. Convert functions returning int to void where seq_printf() is used. Link: http://lkml.kernel.org/p/43590057bcb83846acbbcc1fe641f792b2fb7773.1412031505.git.joe@perches.com Link: http://lkml.kernel.org/r/20141029220107.939492048@goodmis.org Acked-by: David Teigland <teigland@redhat.com> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-05pstore: Honor dmesg_restrict sysctl on dmesg dumpsSebastian Schmidt
When the kernel.dmesg_restrict restriction is in place, only users with CAP_SYSLOG should be able to access crash dumps (like: attacker is trying to exploit a bug, watchdog reboots, attacker can happily read crash dumps and logs). This puts the restriction on console-* types as well as sensitive information could have been leaked there. Other log types are unaffected. Signed-off-by: Sebastian Schmidt <yath@yath.de> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
2014-11-05pstore/ram: Strip ramoops header for correct decompressionBen Zhang
pstore compression/decompression was added during 3.12. The ramoops driver prepends a "====timestamp.timestamp-C|D\n" header to the compressed record before handing it over to pstore driver which doesn't know about the header. In pstore_decompress(), the pstore driver reads the first "==" as a zlib header, so the decompression always fails. For example, this causes the driver to write /dev/pstore/dmesg-ramoops-0.enc.z instead of /dev/pstore/dmesg-ramoops-0. This patch makes the ramoops driver remove the header before pstore decompression. Signed-off-by: Ben Zhang <benzh@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>