summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2014-06-30kernfs: introduce kernfs_pin_sb()Li Zefan
kernfs_pin_sb() tries to get a refcnt of the superblock. This will be used by cgroupfs. v2: - make kernfs_pin_sb() return the superblock. - drop kernfs_drop_sb(). tj: Updated the comment a bit. [ This is a prerequisite for a bugfix. ] Cc: <stable@vger.kernel.org> # 3.15 Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-29Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Fix a regression when trying to compile ext4 on older versions gcc. Fix a number of miscellaneous bugs for punch hole as well as a long-standing potential double buffer head release when failing a block allocation for an indirect-mapped file" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix hole punching for files with indirect blocks ext4: Fix block zeroing when punching holes in indirect block files ext4: decrement free clusters/inodes counters when block group declared bad fs/mbcache: replace __builtin_log2() with ilog2() ext4: Fix buffer double free in ext4_alloc_branch()
2014-06-28btrfs: only unlock block in verify_parent_transid if we locked itJosef Bacik
This is a regression from my patch a26e8c9f75b0bfd8cccc9e8f110737b136eb5994, we need to only unlock the block if we were the one who locked it. Otherwise this will trip BUG_ON()'s in locking.c Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28Btrfs: assert send doesn't attempt to start transactionsFilipe Manana
When starting a transaction just assert that current->journal_info doesn't contain a send transaction stub, since send isn't supposed to start transactions and when it finishes (either successfully or not) it's supposed to set current->journal_info to NULL. This is motivated by the change titled: Btrfs: fix crash when starting transaction Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs compression: reuse recently used workspaceSergey Senozhatsky
Add compression `workspace' in free_workspace() to `idle_workspace' list head, instead of tail. So we have better chances to reuse most recently used `workspace'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28Btrfs: fix crash when mounting raid5 btrfs with missing disksLiu Bo
The reproducer is $ mkfs.btrfs D1 D2 D3 -mraid5 $ mkfs.ext4 D2 && mkfs.ext4 D3 $ mount D1 /btrfs -odegraded ------------------- [ 87.672992] ------------[ cut here ]------------ [ 87.673845] kernel BUG at fs/btrfs/raid56.c:1828! ... [ 87.673845] RIP: 0010:[<ffffffff813efc7e>] [<ffffffff813efc7e>] __raid_recover_end_io+0x4ae/0x4d0 ... [ 87.673845] Call Trace: [ 87.673845] [<ffffffff8116bbc6>] ? mempool_free+0x36/0xa0 [ 87.673845] [<ffffffff813f0255>] raid_recover_end_io+0x75/0xa0 [ 87.673845] [<ffffffff81447c5b>] bio_endio+0x5b/0xa0 [ 87.673845] [<ffffffff81447cb2>] bio_endio_nodec+0x12/0x20 [ 87.673845] [<ffffffff81374621>] end_workqueue_fn+0x41/0x50 [ 87.673845] [<ffffffff813ad2aa>] normal_work_helper+0xca/0x2c0 [ 87.673845] [<ffffffff8108ba2b>] process_one_work+0x1eb/0x530 [ 87.673845] [<ffffffff8108b9c9>] ? process_one_work+0x189/0x530 [ 87.673845] [<ffffffff8108c15b>] worker_thread+0x11b/0x4f0 [ 87.673845] [<ffffffff8108c040>] ? rescuer_thread+0x290/0x290 [ 87.673845] [<ffffffff810939c4>] kthread+0xe4/0x100 [ 87.673845] [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220 [ 87.673845] [<ffffffff817e7c7c>] ret_from_fork+0x7c/0xb0 [ 87.673845] [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220 ------------------- It's because that we miscalculate @rbio->bbio->error so that it doesn't reach maximum of tolerable errors while it should have. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Tested-by: Satoru Takeuchi<takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: create sprout should rename fsid on the sysfs as wellAnand Jain
Creating sprout will change the fsid of the mounted root. do the same on the sysfs as well. reproducer: mount /dev/sdb /btrfs (seed disk) btrfs dev add /dev/sdc /btrfs mount -o rw,remount /btrfs btrfs dev del /dev/sdb /btrfs mount /dev/sdb /btrfs Error: kobject_add_internal failed for fe350492-dc28-4051-a601-e017b17e6145 with -EEXIST, don't try to register things with the same name in the same directory. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: dev replace should replace the sysfs entryAnand Jain
when we replace the device its corresponding sysfs entry has to be replaced as well Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: dev add should add its sysfs entryAnand Jain
we would need the device links to be created, when device is added. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: dev delete should remove sysfs entryAnand Jain
when we delete the device from the mounted btrfs, we would need its corresponding sysfs enty to be removed as well. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: rename add_device_membership to btrfs_kobj_add_deviceAnand Jain
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28percpu-refcount: require percpu_ref to be exited explicitlyTejun Heo
Currently, a percpu_ref undoes percpu_ref_init() automatically by freeing the allocated percpu area when the percpu_ref is killed. While seemingly convenient, this has the following niggles. * It's impossible to re-init a released reference counter without going through re-allocation. * In the similar vein, it's impossible to initialize a percpu_ref count with static percpu variables. * We need and have an explicit destructor anyway for failure paths - percpu_ref_cancel_init(). This patch removes the automatic percpu counter freeing in percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a generic destructor now named percpu_ref_exit(). percpu_ref_destroy() is considered but it gets confusing with percpu_ref_kill() while "exit" clearly indicates that it's the counterpart of percpu_ref_init(). All percpu_ref_cancel_init() users are updated to invoke percpu_ref_exit() instead and explicit percpu_ref_exit() calls are added to the destruction path of all percpu_ref users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Li Zefan <lizefan@huawei.com>
2014-06-28percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()Tejun Heo
ioctx_alloc() reaches inside percpu_ref and directly frees ->pcpu_count in its failure path, which is quite gross. percpu_ref has been providing a proper interface to do this, percpu_ref_cancel_init(), for quite some time now. Let's use that instead. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com>
2014-06-27nfsd: fix rare symlink decoding bugJ. Bruce Fields
An NFS operation that creates a new symlink includes the symlink data, which is xdr-encoded as a length followed by the data plus 0 to 3 bytes of zero-padding as required to reach a 4-byte boundary. The vfs, on the other hand, wants null-terminated data. The simple way to handle this would be by copying the data into a newly allocated buffer with space for the final null. The current nfsd_symlink code tries to be more clever by skipping that step in the (likely) case where the byte following the string is already 0. But that assumes that the byte following the string is ours to look at. In fact, it might be the first byte of a page that we can't read, or of some object that another task might modify. Worse, the NFSv4 code tries to fix the problem by actually writing to that byte. In the NFSv2/v3 cases this actually appears to be safe: - nfs3svc_decode_symlinkargs explicitly null-terminates the data (after first checking its length and copying it to a new page). - NFSv2 limits symlinks to 1k. The buffer holding the rpc request is always at least a page, and the link data (and previous fields) have maximum lengths that prevent the request from reaching the end of a page. In the NFSv4 case the CREATE op is potentially just one part of a long compound so can end up on the end of a page if you're unlucky. The minimal fix here is to copy and null-terminate in the NFSv4 case. The nfsd_symlink() interface here seems too fragile, though. It should really either do the copy itself every time or just require a null-terminated string. Reported-by: Jeff Layton <jlayton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-26ext4: Fix hole punching for files with indirect blocksJan Kara
Hole punching code for files with indirect blocks wrongly computed number of blocks which need to be cleared when traversing the indirect block tree. That could result in punching more blocks than actually requested and thus effectively cause a data loss. For example: fallocate -n -p 10240000 4096 will punch the range 10240000 - 12632064 instead of the range 1024000 - 10244096. Fix the calculation. CC: stable@vger.kernel.org Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26ext4: Fix block zeroing when punching holes in indirect block filesJan Kara
free_holes_block() passed local variable as a block pointer to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local variable instead of proper place in inode / indirect block. We later zero out proper place in inode / indirect block but don't dirty the inode / buffer again which can lead to subtle issues (some changes e.g. to inode can be lost). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26ext4: decrement free clusters/inodes counters when block group declared badNamjae Jeon
We should decrement free clusters counter when block bitmap is marked as corrupt and free inodes counter when the allocation bitmap is marked as corrupt to avoid misunderstanding due to incorrect available size in statfs result. User can get immediately ENOSPC error from write begin without reaching for the writepages. Cc: Darrick J. Wong<darrick.wong@oracle.com> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2014-06-25Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "Small set of misc cifs/smb3 fixes" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: [CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option cifs: revalidate mapping prior to satisfying read_iter request with cache=loose fs/cifs: fix regression in cifs_create_mf_symlink()
2014-06-25Merge tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Highlights include: - Stable fix for a data corruption case due to incorrect cache validation - Fix a couple of false positive cache invalidations - Fix NFSv4 security negotiation issues" * tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for support NFS Return -EPERM if no supported or matching SECINFO flavor NFS check the return of nfs4_negotiate_security in nfs4_submount NFS: Don't mark the data cache as invalid if it has been flushed NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file size nfs: Fix cache_validity check in nfs_write_pageuptodate()
2014-06-25fs/mbcache: replace __builtin_log2() with ilog2()T Makphaibulchoke
Fix compiler error with some gcc version(s) that do not support __builtin_log2() by replacing __builtin_log2() with ilog2(). Signed-off-by: T. Makphaibulchoke <tmac@hp.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
2014-06-25FS/NFS: replace count*size kzalloc by kcallocFabian Frederick
kcalloc manages count*sizeof overflow. Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: linux-nfs@vger.kernel.org Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-25nfs: get rid of duplicate dprintkWeston Andros Adamson
This was introduced by a merge error with my recent pgio patchset. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-25xfs: global error sign conversionDave Chinner
Convert all the errors the core XFs code to negative error signs like the rest of the kernel and remove all the sign conversion we do in the interface layers. Errors for conversion (and comparison) found via searches like: $ git grep " E" fs/xfs $ git grep "return E" fs/xfs $ git grep " E[A-Z].*;$" fs/xfs Negation points found via searches like: $ git grep "= -[a-z,A-Z]" fs/xfs $ git grep "return -[a-z,A-D,F-Z]" fs/xfs $ git grep " -[a-z].*;" fs/xfs [ with some bits I missed from Brian Foster ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25libxfs: move source filesDave Chinner
Move all the source files that are shared with userspace into libxfs/. This is done as one big chunk simpy to get it done quickly Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25libxfs: move header filesDave Chinner
Move all the header files that are shared with userspace into libxfs. This is done as one big chunk simpy to get it done quickly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25xfs: create libxfs infrastructureDave Chinner
To minimise the differences between kernel and userspace code, split the kernel code into the same structure as the userspace code. That is, the gneric core functionality of XFS is moved to a libxfs/ directory and treat it as a layering barrier in the XFS code. This patch introduces the libxfs directory, the build infrastructure and an initial source and header file to build. The libxfs directory will contain the header files that are needed to build libxfs - most of userspace does not care about the location of these header files as they are accessed indirectly. Hence keeping them inside libxfs makes it easy to track the changes and script the sync process as the directory structure will be identical. To allow this changeover to occur in the kernel code, there are some temporary infrastructure in the makefiles to grab the header filesystem from both locations. Once all the files are moved, modifications will be made in the source code that will make the need for these include directives go away. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-24nfs: Fix unused variable errorAnna Schumaker
inode is unused when CONFIG_SUNRPC_DEBUG=n. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: remove unneeded EXPORTsWeston Andros Adamson
EXPORT_GPLs of nfs_pageio_add_request and nfs_pageio_complete aren't needed anymore. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24pnfs: clean up *_resend_to_mdsWeston Andros Adamson
Clean up pnfs_read_done_resend_to_mds and pnfs_write_done_resend_to_mds: - instead of passing all arguments from a nfs_pgio_header, just pass the header - share the common code Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: remove pgio_header refcount, related cleanupWeston Andros Adamson
The refcounting on nfs_pgio_header was related to there being (possibly) more than one nfs_pgio_data. Now that nfs_pgio_data has been merged into nfs_pgio_header, there is no reason to do this ref counting. Just call the completion callback on nfs_pgio_release/nfs_pgio_error. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: remove unused writeverf codeWeston Andros Adamson
Remove duplicate writeverf structure from merge of nfs_pgio_header and nfs_pgio_data and remove writeverf related flags and logic to handle more than one RPC per nfs_pgio_header. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: merge nfs_pgio_data into _headerWeston Andros Adamson
struct nfs_pgio_data only exists as a member of nfs_pgio_header, but is passed around everywhere, because there used to be multiple _data structs per _header. Many of these functions then use the _data to find a pointer to the _header. This patch cleans this up by merging the nfs_pgio_data structure into nfs_pgio_header and passing nfs_pgio_header around instead. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: rename members of nfs_pgio_dataWeston Andros Adamson
Rename "verf" to "writeverf" and "pages" to "page_array" to prepare for merge of nfs_pgio_data and nfs_pgio_header. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: move nfs_pgio_data and remove nfs_rw_headerWeston Andros Adamson
nfs_rw_header was used to allocate an nfs_pgio_header along with an nfs_pgio_data, because a _header would need at least one _data. Now there is only ever one nfs_pgio_data for each nfs_pgio_header -- move it to nfs_pgio_header and get rid of nfs_rw_header. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for supportAndy Adamson
Fix nfs4_negotiate_security to create an rpc_clnt used to test each SECINFO returned pseudoflavor. Check credential creation (and gss_context creation) which is important for RPC_AUTH_GSS pseudoflavors which can fail for multiple reasons including mis-configuration. Don't call nfs4_negotiate in nfs4_submount as it was just called by nfs4_proc_lookup_mountpoint (nfs4_proc_lookup_common) Signed-off-by: Andy Adamson <andros@netapp.com> [Trond: fix corrupt return value from nfs_find_best_sec()] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24NFS Return -EPERM if no supported or matching SECINFO flavorAndy Adamson
Do not return RPC_AUTH_UNIX if SEINFO reply tests fail. This prevents an infinite loop of NFS4ERR_WRONGSEC for non RPC_AUTH_UNIX mounts. Without this patch, a mount with no sec= option to a server that does not include RPC_AUTH_UNIX in the SECINFO return can be presented with an attemtp to use RPC_AUTH_UNIX which will result in an NFS4ERR_WRONG_SEC which will prompt the SECINFO call which will again try RPC_AUTH_UNIX.... Signed-off-by: Andy Adamson <andros@netapp.com> Tested-By: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24NFS check the return of nfs4_negotiate_security in nfs4_submountAndy Adamson
Signed-off-by: Andy Adamson <andros@netapp.com> Tested-By: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24NFS: Don't mark the data cache as invalid if it has been flushedTrond Myklebust
Now that we have functions such as nfs_write_pageuptodate() that use the cache_validity flags to check if the data cache is valid or not, it is a little more important to keep the flags in sync with the state of the data cache. In particular, we'd like to ensure that if the data cache is empty, we don't start marking it as needing revalidation. Reported-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file sizeTrond Myklebust
In nfs_update_inode(), if the change attribute is seen to change on the server, then we set NFS_INO_REVAL_PAGECACHE in order to make sure that we check the file size. However, if we also update the file size in the same function, we don't need to check it again. So make sure that we clear the NFS_INO_REVAL_PAGECACHE that was set earlier. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: Fix cache_validity check in nfs_write_pageuptodate()Scott Mayhew
NFS_INO_INVALID_DATA cannot be ignored, even if we have a delegation. We're still having some problems with data corruption when multiple clients are appending to a file and those clients are being granted write delegations on open. To reproduce: Client A: vi /mnt/`hostname -s` while :; do echo "XXXXXXXXXXXXXXX" >>/mnt/file; sleep $(( $RANDOM % 5 )); done Client B: vi /mnt/`hostname -s` while :; do echo "YYYYYYYYYYYYYYY" >>/mnt/file; sleep $(( $RANDOM % 5 )); done What's happening is that in nfs_update_inode() we're recognizing that the file size has changed and we're setting NFS_INO_INVALID_DATA accordingly, but then we ignore the cache_validity flags in nfs_write_pageuptodate() because we have a delegation. As a result, in nfs_updatepage() we're extending the write to cover the full page even though we've not read in the data to begin with. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: <stable@vger.kernel.org> # v3.11+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()Oleg Nesterov
ioctx_add_table() is the writer, it does not need rcu_read_lock() to protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just add the confusion. And it doesn't need rcu_dereference() by the same reason, it must see any updates previously done under the same ->ioctx_lock. We could use rcu_dereference_protected() but the patch uses rcu_dereference_raw(), the function is simple enough. The same for kill_ioctx(), although it does not update the pointer. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-06-24aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()Oleg Nesterov
On 04/30, Benjamin LaHaise wrote: > > > - ctx->mmap_size = 0; > > - > > - kill_ioctx(mm, ctx, NULL); > > + if (ctx) { > > + ctx->mmap_size = 0; > > + kill_ioctx(mm, ctx, NULL); > > + } > > Rather than indenting and moving the two lines changing mmap_size and the > kill_ioctx() call, why not just do "if (!ctx) ... continue;"? That reduces > the number of lines changed and avoid excessive indentation. OK. To me the code looks better/simpler with "if (ctx)", but this is subjective of course, I won't argue. The patch still removes the empty line between mmap_size = 0 and kill_ioctx(), we reset mmap_size only for kill_ioctx(). But feel free to remove this change. ------------------------------------------------------------------------------- Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() 1. We can read ->ioctx_table only once and we do not read rcu_read_lock() or even rcu_dereference(). This mm has no users, nobody else can play with ->ioctx_table. Otherwise the code is buggy anyway, if we need rcu_read_lock() in a loop because ->ioctx_table can be updated then kfree(table) is obviously wrong. 2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid munmap(), but another reason is that we simply can't do vm_munmap() unless current->mm == mm and this is not true in general, the caller is mmput(). 3. We do not really need to nullify mm->ioctx_table before return, probably the current code does this to catch the potential problems. But in this case RCU_INIT_POINTER(NULL) looks better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-06-24aio: fix kernel memory disclosure in io_getevents() introduced in v3.10Benjamin LaHaise
A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10 by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to aio_read_events_ring() failed to correctly limit the index into ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of an arbitrary page with a copy_to_user() to copy the contents into userspace. This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and Petr for disclosing this issue. This patch applies to v3.12+. A separate backport is needed for 3.10/3.11. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: stable@vger.kernel.org
2014-06-24aio: fix aio request leak when events are reaped by userspaceBenjamin LaHaise
The aio cleanups and optimizations by kmo that were merged into the 3.10 tree added a regression for userspace event reaping. Specifically, the reference counts are not decremented if the event is reaped in userspace, leading to the application being unable to submit further aio requests. This patch applies to 3.12+. A separate backport is required for 3.10/3.11. This issue was uncovered as part of CVE-2014-0206. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org Cc: Kent Overstreet <kmo@daterainc.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com>
2014-06-24[CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars ↵Steve French
option When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ? via the Unicode Windows to POSIX remap range) empty paths (eg when we open "" to query the root of the SMB3 directory on mount) were not null terminated so we sent garbarge as a path name on empty paths which caused SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified. mapchars is particularly important since Unix Extensions for SMB3 are not supported (yet) Signed-off-by: Steve French <smfrench@gmail.com> Cc: <stable@vger.kernel.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
2014-06-23ocfs2/dlm: do not purge lockres that is queued for assert masterXue jiufei
When workqueue is delayed, it may occur that a lockres is purged while it is still queued for master assert. it may trigger BUG() as follows. N1 N2 dlm_get_lockres() ->dlm_do_master_requery is the master of lockres, so queue assert_master work dlm_thread() start running and purge the lockres dlm_assert_master_worker() send assert master message to other nodes receiving the assert_master message, set master to N2 dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID, if it is RECOVERY lockres, it triggers the BUG(). Another BUG() is triggered when N3 become the new master and send assert_master to N1, N1 will trigger the BUG() because owner doesn't match. So we should not purge lockres when it is queued for assert master. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23ocfs2: do not return DLM_MIGRATE_RESPONSE_MASTERY_REF to avoid endless,loop ↵jiangyiwen
during umount The following case may lead to endless loop during umount. node A node B node C node D umount volume, migrate lockres1 to B want to lock lockres1, send MASTER_REQUEST_MSG to C init block mle send MIGRATE_REQUEST_MSG to C find a block mle, and then return DLM_MIGRATE_RESPONSE_MASTERY_REF to B set C in refmap umount successfully try to umount, endless loop occurs when migrate lockres1 since C is in refmap So we can fix this endless loop case by only returning DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving MIGRATE_REQUEST_MSG. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Xue jiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23ocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ↵jiangyiwen
ocfs2_mknod When the call to ocfs2_add_entry() failed in ocfs2_symlink() and ocfs2_mknod(), iput() will not be called during dput(dentry) because no d_instantiate(), and this will lead to umount hung. Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23ocfs2: fix a tiny race when running dirop_fileop_racerYiwen Jiang
When running dirop_fileop_racer we found a dead lock case. 2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create /race/16/1 in the filesystem, and let the inode number of dir 16 is less than the inode number of dir race. Node A Node B mv /race/16/1 /race/ right after Node A has got the EX mode of /race/16/, and tries to get EX mode of /race ls /race/16/ In this case, Node A has got the EX mode of /race/16/, and wants to get EX mode of /race/. Node B has got the PR mode of /race/, and wants to get the PR mode of /race/16/. Since EX and PR are mutually exclusive, dead lock happens. This patch fixes this case by locking in ancestor order before trying inode number order. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23ocfs2/dlm: fix misuse of list_move_tail() in dlm_run_purge_list()Xue jiufei
When a lockres in purge list but is still in use, it should be moved to the tail of purge list. dlm_thread will continue to check next lockres in purge list. However, code list_move_tail(&dlm->purge_list, &lockres->purge) will do *no* movements, so dlm_thread will purge the same lockres in this loop again and again. If it is in use for a long time, other lockres will not be processed. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>