summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2012-11-14userns: Support autofs4 interacing with multiple user namespacesEric W. Biederman
Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue. When creating directories and symlinks default the uid and gid of the mount requester to the global root uid and gid. autofs4_wait will update these fields when a mount is requested. When generating autofsv5 packets report the uid and gid of the mount requestor in user namespace of the process that opened the pipe, reporting unmapped uids and gids as overflowuid and overflowgid. In autofs_dev_ioctl_requester return the uid and gid of the last mount requester converted into the calling processes user namespace. When the uid or gid don't map return overflowuid and overflowgid as appropriate, allowing failure to find a mount requester to be distinguished from failure to map a mount requester. The uid and gid mount options specifying the user and group of the root autofs inode are converted into kuid and kgid as they are parsed defaulting to the current uid and current gid of the process that mounts autofs. Mounting of autofs for the present remains confined to processes in the initial user namespace. Cc: Ian Kent <raven@themaw.net> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-14ext4: init pagevec in ext4_da_block_invalidatepagesEric Sandeen
ext4_da_block_invalidatepages is missing a pagevec_init(), which means that pvec->cold contains random garbage. This affects whether the page goes to the front or back of the LRU when ->cold makes it to free_hot_cold_page() Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-11-14pstore: Fix NULL pointer dereference in console writesColin Ian King
Passing a NULL id causes a NULL pointer deference in writers such as erst_writer and efi_pstore_write because they expect to update this id. Pass a dummy id instead. This avoids a cascade of oopses caused when the initial pstore_console_write passes a null which in turn causes writes to the console causing further oopses in subsequent pstore_console_write calls. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
2012-11-14xfs: remove xfs_flushinval_pagesDave Chinner
It's just a simple wrapper around VFS functionality, and is actually bugging in that it doesn't remove mappings before invalidating the page cache. Remove it and replace it with the correct VFS functionality. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-14xfs: remove xfs_flush_pagesDave Chinner
It is a complex wrapper around VFS functions, but there are VFS functions that provide exactly the same functionality. Call the VFS functions directly and remove the unnecessary indirection and complexity. We don't need to care about clearing the XFS_ITRUNCATED flag, as that is done during .writepages. Hence is cleared by the VFS writeback path if there is anything to write back during the flush. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-14xfs: remove xfs_wait_on_pages()Dave Chinner
It's just a simple wrapper around a VFS function that is only called by another function in xfs_fs_subr.c. Remove it and call the VFS function directly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-14xfs: reverse the check on XFS_IOC_ZERO_RANGEAndrew Dahl
Reversing the check on XFS_IOC_ZERO_RANGE. Range should be zeroed if the start is less than or equal to the end. Signed-off-by: Andrew Dahl <adahl@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-14xfs: remove xfs_tosspagesDave Chinner
It's a buggy, unnecessary wrapper that is duplicating truncate_pagecache_range(). When replacing the call in xfs_change_file_space(), also ensure that the length being allocated/freed is always positive before making any changes. These checks are done in the lower extent manipulation functions, too, but we need to do them before any page cache operations. Reported-by: Andrew Dahl <adahl@sgi.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-By: Andrew Dahl <adahl@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-14Merge v3.7-rc5 into tty-nextGreg Kroah-Hartman
This pulls in the 3.7-rc5 fixes into tty-next to make it easier to test.
2012-11-14nfsd4: get_backchannel_cred should be staticFengguang Wu
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-14nfsd4: init_session should be declared staticFengguang Wu
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-14GFS2: skip dlm_unlock calls in unmountDavid Teigland
When unmounting, gfs2 does a full dlm_unlock operation on every cached lock. This can create a very large amount of work and can take a long time to complete. However, the vast majority of these dlm unlock operations are unnecessary because after all the unlocks are done, gfs2 leaves the dlm lockspace, which automatically clears the locks of the leaving node, without unlocking each one individually. So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to remove the locks implicitly. The one exception is when the lock's lvb is being used. In this case, dlm_unlock is called because it may update the lvb of the resource. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-11-13xfs: make growfs initialise the AGFL headerDave Chinner
For verification purposes, AGFLs need to be initialised to a known set of values. For upcoming CRC changes, they are also headers that need to be initialised. Currently, growfs does neither for the AGFLs - it ignores them completely. Add initialisation of the AGFL to be full of invalid block numbers (NULLAGBLOCK) to put the infrastructure in place needed for CRC support. Includes a comment clarification from Jeff Liu. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-13xfs: growfs: use uncached buffers for new headersDave Chinner
When writing the new AG headers to disk, we can't attach write verifiers because they have a dependency on the struct xfs-perag being attached to the buffer to be fully initialised and growfs can't fully initialise them until later in the process. The simplest way to avoid this problem is to use uncached buffers for writing the new headers. These buffers don't have the xfs-perag attached to them, so it's simple to detect in the write verifier and be able to skip the checks that need the xfs-perag. This enables us to attach the appropriate buffer ops to the buffer and hence calculate CRCs on the way to disk. IT also means that the buffer is torn down immediately, and so the first access to the AG headers will re-read the header from disk and perform full verification of the buffer. This way we also can catch corruptions due to problems that went undetected in growfs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-13xfs: use btree block initialisation functions in growfsDave Chinner
Factor xfs_btree_init_block() to be independent of the btree cursor, and use the function to initialise btree blocks in the growfs code. This makes adding support for different format btree blocks simple. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-13xfs: add more attribute tree trace points.Dave Chinner
Added when debugging recent attribute tree problems to more finely trace code execution through the maze of twisty passages that makes up the attr code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-13xfs: drop buffer io reference when a bad bio is builtDave Chinner
Error handling in xfs_buf_ioapply_map() does not handle IO reference counts correctly. We increment the b_io_remaining count before building the bio, but then fail to decrement it in the failure case. This leads to the buffer never running IO completion and releasing the reference that the IO holds, so at unmount we can leak the buffer. This leak is captured by this assert failure during unmount: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273 This is not a new bug - the b_io_remaining accounting has had this problem for a long, long time - it's just very hard to get a zero length bio being built by this code... Further, the buffer IO error can be overwritten on a multi-segment buffer by subsequent bio completions for partial sections of the buffer. Hence we should only set the buffer error status if the buffer is not already carrying an error status. This ensures that a partial IO error on a multi-segment buffer will not be lost. This part of the problem is a regression, however. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-13xfs: fix broken error handling in xfs_vm_writepageDave Chinner
When we shut down the filesystem, it might first be detected in writeback when we are allocating a inode size transaction. This happens after we have moved all the pages into the writeback state and unlocked them. Unfortunately, if we fail to set up the transaction we then abort writeback and try to invalidate the current page. This then triggers are BUG() in block_invalidatepage() because we are trying to invalidate an unlocked page. Fixing this is a bit of a chicken and egg problem - we can't allocate the transaction until we've clustered all the pages into the IO and we know the size of it (i.e. whether the last block of the IO is beyond the current EOF or not). However, we don't want to hold pages locked for long periods of time, especially while we lock other pages to cluster them into the write. To fix this, we need to make a clear delineation in writeback where errors can only be handled by IO completion processing. That is, once we have marked a page for writeback and unlocked it, we have to report errors via IO completion because we've already started the IO. We may not have submitted any IO, but we've changed the page state to indicate that it is under IO so we must now use the IO completion path to report errors. To do this, add an error field to xfs_submit_ioend() to pass it the error that occurred during the building on the ioend chain. When this is non-zero, mark each ioend with the error and call xfs_finish_ioend() directly rather than building bios. This will immediately push the ioends through completion processing with the error that has occurred. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-13xfs: fix attr tree double split corruptionDave Chinner
In certain circumstances, a double split of an attribute tree is needed to insert or replace an attribute. In rare situations, this can go wrong, leaving the attribute tree corrupted. In this case, the attr being replaced is the last attr in a leaf node, and the replacement is larger so doesn't fit in the same leaf node. When we have the initial condition of a node format attribute btree with two leaves at index 1 and 2. Call them L1 and L2. The leaf L1 is completely full, there is not a single byte of free space in it. L2 is mostly empty. The attribute being replaced - call it X - is the last attribute in L1. The way an attribute replace is executed is that the replacement attribute - call it Y - is first inserted into the tree, but has an INCOMPLETE flag set on it so that list traversals ignore it. Once this transaction is committed, a second transaction it run to atomically mark Y as COMPLETE and X as INCOMPLETE, so that a traversal will now find Y and skip X. Once that transaction is committed, attribute X is then removed. So, the initial condition is: +--------+ +--------+ | L1 | | L2 | | fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 | | fsp: 0 | | fsp: N | |--------| |--------| | attr A | | attr 1 | |--------| |--------| | attr B | | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr X | | attr n | +--------+ +--------+ So now we go to replace X, and see that L1:fsp = 0 - it is full so we can't insert Y in the same leaf. So we record the the location of attribute X so we can track it for later use, then we split L1 into L1 and L3 and reblance across the two leafs. We end with: +--------+ +--------+ +--------+ | L1 | | L3 | | L2 | | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 | | fsp: M | | fsp: J | | fsp: N | |--------| |--------| |--------| | attr A | | attr X | | attr 1 | |--------| +--------+ |--------| | attr B | | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr W | | attr n | +--------+ +--------+ And we track that the original attribute is now at L3:0. We then try to insert Y into L1 again, and find that there isn't enough room because the new attribute is larger than the old one. Hence we have to split again to make room for Y. We end up with this: +--------+ +--------+ +--------+ +--------+ | L1 | | L4 | | L3 | | L2 | | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 | | fsp: M | | fsp: J | | fsp: J | | fsp: N | |--------| |--------| |--------| |--------| | attr A | | attr Y | | attr X | | attr 1 | |--------| + INCOMP + +--------+ |--------| | attr B | +--------+ | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr W | | attr n | +--------+ +--------+ And now we have the new (incomplete) attribute @ L4:0, and the original attribute at L3:0. At this point, the first transaction is committed, and we move to the flipping of the flags. This is where we are supposed to end up with this: +--------+ +--------+ +--------+ +--------+ | L1 | | L4 | | L3 | | L2 | | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 | | fsp: M | | fsp: J | | fsp: J | | fsp: N | |--------| |--------| |--------| |--------| | attr A | | attr Y | | attr X | | attr 1 | |--------| +--------+ + INCOMP + |--------| | attr B | +--------+ | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr W | | attr n | +--------+ +--------+ But that doesn't happen properly - the attribute tracking indexes are not pointing to the right locations. What we end up with is both the old attribute to be removed pointing at L4:0 and the new attribute at L4:1. On a debug kernel, this assert fails like so: XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725 because the new attribute location does not exist. On a production kernel, this goes unnoticed and the code proceeds ahead merrily and removes L4 because it thinks that is the block that is no longer needed. This leaves the hash index node pointing to entries L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free space, and so everything is busted. This corruption is caused by the removal of the old attribute triggering a join - it joins everything correctly but then frees the wrong block. xfs_repair will report something like: bad sibling back pointer for block 4 in attribute fork for inode 131 problem with attribute contents in inode 131 would clear attr fork bad nblocks 8 for inode 131, would reset to 3 bad anextents 4 for inode 131, would reset to 0 The problem lies in the assignment of the old/new blocks for tracking purposes when the double leaf split occurs. The first split tries to place the new attribute inside the current leaf (i.e. "inleaf == true") and moves the old attribute (X) to the new block. This sets up the old block/index to L1:X, and newly allocated block to L3:0. It then moves attr X to the new block and tries to insert attr Y at the old index. That fails, so it splits again. With the second split, the rebalance ends up placing the new attr in the second new block - L4:0 - and this is where the code goes wrong. What is does is it sets both the new and old block index to the second new block. Hence it inserts attr Y at the right place (L4:0) but overwrites the current location of the attr to replace that is held in the new block index (currently L3:0). It over writes it with L4:1 - the index we later assert fail on. Hopefully this table will show this in a foramt that is a bit easier to understand: Split old attr index new attr index vanilla patched vanilla patched before 1st L1:26 L1:26 N/A N/A after 1st L3:0 L3:0 L1:26 L1:26 after 2nd L4:0 L3:0 L4:1 L4:0 ^^^^ ^^^^ wrong wrong The fix is surprisingly simple, for all this analysis - just stop the rebalance on the out-of leaf case from overwriting the new attr index - it's already correct for the double split case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-13GFS2: Fix one RG corner caseSteven Whitehouse
For filesystems with only a single resource group, we need to be careful that the allocation loop will not land up with a NULL resource group. This fixes a bug in a previous patch where the gfs2_rgrpd_get_next() function was being used instead of gfs2_rgrpd_get_first() Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-11-13GFS2: Eliminate redundant buffer_head manipulation in gfs2_unlink_inodeBob Peterson
Since we now have a dirty_inode that takes care of manipulating the inode buffer and writing from the inode to the buffer, we can eliminate some unnecessary buffer manipulations in gfs2_unlink_inode that are now redundant. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-11-13GFS2: Use dirty_inode in gfs2_dir_addBob Peterson
This patch changes the gfs2_dir_add function so that it uses the dirty_inode function (via mark_inode_dirty) rather than manually updating the dinode. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-11-13GFS2: Fix truncation of journaled data filesSteven Whitehouse
This patch fixes an issue relating to not having enough revokes available when truncating journaled data files. In order to ensure that we do no run out, the truncation is broken into separate pieces if it is large enough. Tested using fsx on a journaled data file. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-11-12ext4: don't verify checksums of dx non-leaf nodes during fallback scanDarrick J. Wong
During a directory entry lookup of a hashed directory, if the hash-based lookup functions fail and we fall back to a linear scan, don't try to verify the dirent checksum on the internal nodes of the hash tree because they don't store a checksum in a hidden dirent like the leaf nodes do. Reported-by: George Spelvin <linux@horizon.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-12nfsd: release the legacy reclaimable clients list in grace_doneJeff Layton
The current code holds on to this list until nfsd is shut down, but it's never touched once the grace period ends. Release that memory back into the wild when the grace period ends. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: get rid of cl_recdir fieldJeff Layton
Remove the cl_recdir field from the nfs4_client struct. Instead, just compute it on the fly when and if it's needed, which is now only when the legacy client tracking code is in effect. The error handling in the legacy client tracker is also changed to handle the case where md5 is unavailable. In that case, we'll warn the admin with a KERN_ERR message and disable the client tracking. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: move the confirmed and unconfirmed hlists to a rbtreeJeff Layton
The current code requires that we md5 hash the name in order to store the client in the confirmed and unconfirmed trees. Change it instead to store the clients in a pair of rbtrees, and simply compare the cl_names directly instead of hashing them. This also necessitates that we add a new flag to the clp->cl_flags field to indicate which tree the client is currently in. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: don't search for client by hash on legacy reboot recovery gracedoneJeff Layton
When nfsd starts, the legacy reboot recovery code creates a tracking struct for each directory in the v4recoverydir. When the grace period ends, it basically does a "readdir" on the directory again, and matches each dentry in there to an existing client id to see if it should be removed or not. If the matching client doesn't exist, or hasn't reclaimed its state then it will remove that dentry. This is pretty inefficient since it involves doing a lot of hash-bucket searching. It also means that we have to keep relying on being able to search for a nfs4_client by md5 hashed cl_recdir name. Instead, add a pointer to the nfs4_client that indicates the association between the nfs4_client_reclaim and nfs4_client. When a reclaim operation comes in, we set the pointer to make that association. On gracedone, the legacy client tracker will keep the recdir around iff: 1/ there is a reclaim record for the directory ...and... 2/ there's an association between the reclaim record and a client record -- that is, a create or check operation was performed on the client that matches that directory. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: make nfs4_client_to_reclaim return a pointer to the reclaim recordJeff Layton
Later callers will need to make changes to the record. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: break out reclaim record removal into separate functionJeff Layton
We'll need to be able to call this from nfs4recover.c eventually. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: have nfsd4_find_reclaim_client take a char * argumentJeff Layton
Currently, it takes a client pointer, but later we're going to need to search for these records without knowing whether a matching client even exists. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: warn about impending removal of nfsdcld upcallJeff Layton
Let's shoot for removing the nfsdcld upcall in 3.10. Most likely, no one is actually using it so I don't expect this warning to fire often (except maybe on misconfigured systems). Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: pass info about the legacy recoverydir in environment variablesJeff Layton
The usermodehelper upcall program can then decide to use this info as a (one-way) transition mechanism to the new scheme. When a "check" upcall occurs and the client doesn't exist in the database, we can look to see whether the directory exists. If it does, then we'd add the client to the database, remove the legacy recdir, and return success to the kernel to allow the recovery to proceed. For gracedone, we simply pass the v4recovery "topdir" so that the upcall can clean it out prior to returning to the kernel. A module parm is also added to disable the legacy conversion if the admin chooses. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: change heuristic for selecting the client_tracking_opsJeff Layton
First, try to use the new usermodehelper upcall. It should succeed or fail quickly, so there's little cost to doing so. If it fails, and the legacy tracking dir exists, use that. If it doesn't exist then fall back to using nfsdcld. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12nfsd: add a usermodehelper upcall for NFSv4 client ID trackingJeff Layton
Add a new client tracker upcall type that uses call_usermodehelper to call out to a program. This seems to be the preferred method of calling out to usermode these days for seldom-called upcalls. It's simple and doesn't require a running daemon, so it should "just work" as long as the binary is installed. The client tracking exit operation is also changed to check for a NULL pointer before running. The UMH upcall doesn't need to do anything at module teardown time. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-12kill bogus BUG_ON() in do_close_on_exec()Al Viro
It can be legitimately triggered via procfs access. Now, at least 2 of 3 of get_files_struct() callers in procfs are useless, but when and if we get rid of those we can always add WARN_ON() here. BUG_ON() at that spot is simply wrong. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-10ext4: do not use ext4_error() when there is no space in dir leaf for csumTheodore Ts'o
If there is no space for a checksum in a directory leaf node, previously we would use EXT4_ERROR_INODE() which would mark the file system as inconsistent. While it would be nice to use e2fsck -D, it certainly isn't required, so just print a warning using ext4_warning(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
2012-11-10nfsd: remove unused argument to nfs4_has_reclaimed_stateJeff Layton
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-10nfsd: fix error handling in nfsd4_remove_clid_dirJeff Layton
If the credential save fails, then we'll leak our mnt_want_write_file reference. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-10Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Jeff Layton. * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: cifs: Do not lookup hashed negative dentry in cifs_atomic_open cifs: fix potential buffer overrun in cifs.idmap handling code
2012-11-09jffs2: Fix lock acquisition order bug in jffs2_write_beginThomas Betker
jffs2_write_begin() first acquires the page lock, then f->sem. This causes an AB-BA deadlock with jffs2_garbage_collect_live(), which first acquires f->sem, then the page lock: jffs2_garbage_collect_live mutex_lock(&f->sem) (A) jffs2_garbage_collect_dnode jffs2_gc_fetch_page read_cache_page_async do_read_cache_page lock_page(page) (B) jffs2_write_begin grab_cache_page_write_begin find_lock_page lock_page(page) (B) mutex_lock(&f->sem) (A) We fix this by restructuring jffs2_write_begin() to take f->sem before the page lock. However, we make sure that f->sem is not held when calling jffs2_reserve_space(), as this is not permitted by the locking rules. The deadlock above was observed multiple times on an SoC with a dual ARMv7 (Cortex-A9), running the long-term 3.4.11 kernel; it occurred when using scp to copy files from a host system to the ARM target system. The fix was heavily tested on the same target system. Cc: stable@vger.kernel.org Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com> Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-11-09Merge branch 'akpm' (Fixes from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "Five fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (5 patches) h8300: add missing L1_CACHE_SHIFT mm: bugfix: set current->reclaim_state to NULL while returning from kswapd() fanotify: fix missing break revert "epoll: support for disabling items, and a self-test app" checkpatch: improve network block comment style checking
2012-11-09Merge tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs bugfixes from Ben Myers: - fix for large transactions spanning multiple iclog buffers - zero the allocation_args structure on the stack before using it to determine whether to use a worker for allocation - move allocation stack switch to xfs_bmapi_allocate in order to prevent deadlock on AGF buffers - growfs no longer reads in garbage for new secondary superblocks - silence a build warning - ensure that invalid buffers never get written to disk while on free list - don't vmap inode cluster buffers during free - fix buffer shutdown reference count mismatch - fix reading of wrapped log data * tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs: xfs: fix reading of wrapped log data xfs: fix buffer shudown reference count mismatch xfs: don't vmap inode cluster buffers during free xfs: invalidate allocbt blocks moved to the free list xfs: silence uninitialised f.file warning. xfs: growfs: don't read garbage for new secondary superblocks xfs: move allocation stack switch up to xfs_bmapi_allocate xfs: introduce XFS_BMAPI_STACK_SWITCH xfs: zero allocation_args on the kernel stack xfs: only update the last_sync_lsn when a transaction completes
2012-11-09fanotify: fix missing breakEric Paris
Anders Blomdell noted in 2010 that Fanotify lost events and provided a test case. Eric Paris confirmed it was a bug and posted a fix to the list https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/RrJfTfyW2BE but never applied it. Repeated attempts over time to actually get him to apply it have never had a reply from anyone who has raised it So apply it anyway Signed-off-by: Alan Cox <alan@linux.intel.com> Reported-by: Anders Blomdell <anders.blomdell@control.lth.se> Cc: Eric Paris <eparis@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-09revert "epoll: support for disabling items, and a self-test app"Andrew Morton
Revert commit 03a7beb55b9f ("epoll: support for disabling items, and a self-test app") pending resolution of the issues identified by Michael Kerrisk, copied below. We'll revisit this for 3.8. : I've taken a look at this patch as it currently stands in 3.7-rc1, and : done a bit of testing. (By the way, the test program : tools/testing/selftests/epoll/test_epoll.c does not compile...) : : There are one or two places where the behavior seems a little strange, : so I have a question or two at the end of this mail. But other than : that, I want to check my understanding so that the interface can be : correctly documented. : : Just to go though my understanding, the problem is the following : scenario in a multithreaded application: : : 1. Multiple threads are performing epoll_wait() operations, : and maintaining a user-space cache that contains information : corresponding to each file descriptor being monitored by : epoll_wait(). : : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL) : a file descriptor from the epoll interest list, and : delete the corresponding record from the user-space cache. : : 3. The problem with (2) is that some other thread may have : previously done an epoll_wait() that retrieved information : about the fd in question, and may be in the middle of using : information in the cache that relates to that fd. Thus, : there is a potential race. : : 4. The race can't solved purely in user space, because doing : so would require applying a mutex across the epoll_wait() : call, which would of course blow thread concurrency. : : Right? : : Your solution is the EPOLL_CTL_DISABLE operation. I want to : confirm my understanding about how to use this flag, since : the description that has accompanied the patches so far : has been a bit sparse : : 0. In the scenario you're concerned about, deleting a file : descriptor means (safely) doing the following: : (a) Deleting the file descriptor from the epoll interest list : using EPOLL_CTL_DEL : (b) Deleting the corresponding record in the user-space cache : : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in : conjunction with EPOLLONESHOT. : : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in : conjunction is a logical error. : : 3. The correct way to code multithreaded applications using : EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows: : : a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should : should EPOLLONESHOT. : : b. When a thread wants to delete a file descriptor, it : should do the following: : : [1] Call epoll_ctl(EPOLL_CTL_DISABLE) : [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE) : was zero, then the file descriptor can be safely : deleted by the thread that made this call. : [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY, : then the descriptor is in use. In this case, the calling : thread should set a flag in the user-space cache to : indicate that the thread that is using the descriptor : should perform the deletion operation. : : Is all of the above correct? : : The implementation depends on checking on whether : (events & ~EP_PRIVATE_BITS) == 0 : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be : cleared. : : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE : is only useful in conjunction with EPOLLONESHOT. However, as things : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does : not have EPOLLONESHOT set in 'events' This results in the following : (slightly surprising) behavior: : : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0 : (the indicator that the file descriptor can be safely deleted). : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY. : : This doesn't seem particularly useful, and in fact is probably an : indication that the user made a logic error: they should only be using : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which : EPOLLONESHOT was set in 'events'. If that is correct, then would it : not make sense to return an error to user space for this case? Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paton J. Lewis" <palewis@adobe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-08ext4: introduce lseek SEEK_DATA/SEEK_HOLE supportZheng Liu
This patch makes ext4 really support SEEK_DATA/SEEK_HOLE flags. Block-mapped and extent-mapped files are fully implemented together because ext4_map_blocks hides this differences. After applying this patch, it will cause a failure in xfstest #285 when the file is block-mapped due to block-mapped file isn't support fallocate(2). I had tried to use ext4_ext_walk_space() to retrieve the offset for a extent-mapped file. But finally I decide to keep using ext4_map_blocks() to support SEEK_DATA/SEEK_HOLE because ext4_map_blocks() can hide the difference between block-mapped file and extent-mapped file. Moreover, in next step, extent status tree will track all extent status, and we can get all mappings from this tree. So I think that using ext4_map_blocks() is a better choice. CC: Hugh Dickins <hughd@google.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08ext4: reimplement fiemap using extent status treeZheng Liu
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08ext4: reimplement ext4_find_delay_alloc_range on extent status treeZheng Liu
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08ext4: add some tracepoints in extent status treeZheng Liu
This patch adds some tracepoints in extent status tree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08ext4: let ext4 maintain extent status treeZheng Liu
This patch lets ext4 maintain extent status tree. Currently it only tracks delay extent status in extent status tree. When a delay allocation is issued, the related delay extent will be inserted into extent status tree. When a delay extent is written out or invalidated, it will be removed from this tree. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>