summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2013-04-08ext4: implementation of a new ioctl called EXT4_IOC_SWAP_BOOTDr. Tilmann Bubeck
Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and associated attributes (like i_blocks, i_size, i_flags, ...) from the specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is typically used to store a boot loader in a secure part of the filesystem, where it can't be changed by a normal user by accident. The data blocks of the previous boot loader will be associated with the given inode. This usercode program is a simple example of the usage: int main(int argc, char *argv[]) { int fd; int err; if ( argc != 2 ) { printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n"); exit(1); } fd = open(argv[1], O_WRONLY); if ( fd < 0 ) { perror("open"); exit(1); } err = ioctl(fd, EXT4_IOC_SWAP_BOOT); if ( err < 0 ) { perror("ioctl"); exit(1); } close(fd); exit(0); } [ Modified by Theodore Ts'o to fix a number of bugs in the original code.] Signed-off-by: Dr. Tilmann Bubeck <t.bubeck@reinform.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-08nfsd4: cleanup handling of nfsv4.0 closed stateid'sJ. Bruce Fields
Closed stateid's are kept around a little while to handle close replays in the 4.0 case. So we stash them in the last-used stateid in the oo_last_closed_stateid field of the open owner. We can free that in encode_seqid_op_tail once the seqid on the open owner is next incremented. But we don't want to do that on the close itself; so we set NFS4_OO_PURGE_CLOSE flag set on the open owner, skip freeing it the first time through encode_seqid_op_tail, then when we see that flag set next time we free it. This is unnecessarily baroque. Instead, just move the logic that increments the seqid out of the xdr code and into the operation code itself. The justification given for the current placement is that we need to wait till the last minute to be sure we know whether the status is a sequence-id-mutating error or not, but examination of the code shows that can't actually happen. Reported-by: Yanchuan Nian <ycnian@gmail.com> Tested-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-08GFS2: replace gfs2_ail structure with gfs2_transBenjamin Marzinski
In order to allow transactions and log flushes to happen at the same time, gfs2 needs to move the transaction accounting and active items list code into the gfs2_trans structure. As a first step toward this, this patch removes the gfs2_ail structure, and handles the active items list in the gfs_trans structure. This keeps gfs2 from allocating an ail structure on log flushes, and gives us a struture that can later be used to store the transaction accounting outside of the gfs2 superblock structure. With this patch, at the end of a transaction, gfs2 will add the gfs2_trans structure to the superblock if there is not one already. This structure now has the active items fields that were previously in gfs2_ail. This is not necessary in the case where the transaction was simply used to add revokes, since these are never written outside of the journal, and thus, don't need an active items list. Also, in order to make sure that the transaction structure is not removed while it's still in use by gfs2_trans_end, unlocking the sd_log_flush_lock has to happen slightly later in ending the transaction. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-08GFS2: Remove vestigial parameter ip from function rs_deltreeBob Peterson
The functions that delete block reservations from the rgrp block reservations rbtree no longer use the ip parameter. This patch eliminates the parameter. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-08GFS2: Use gfs2_dinode_out() in the inode create pathSteven Whitehouse
Over the previous two patches relating to inode creation, the content of init_dinode() has been looking more and more like gfs2_dinode_out(). This is not an accident! This patch replaces the parts of init_dinode() which are duplicated in gfs2_dinode_out() with a call to that function. Mostly that is straightforward, but there is one issue which needed to be resolved relating to the link count. The link count has to be set to zero in a certain error handling code path, which lands up calling iput(). This is now done specifically in that code path allowing the link count to be set earlier and written into the on disk inode by gfs2_dinode_put() in the normal way. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-08GFS2: Remove gfs2_refresh_inode from inode creation pathSteven Whitehouse
The original method for creating inodes used in GFS2 was to fill out a buffer, with all the information, and then to read that buffer into the in-core inode, using gfs2_refresh_inode() The problem with this approach is that all the inode's fields need to be calculated ahead of time, and were stored in various variables making the code rather complicated. The new approach is simply to allocate the in-core inode earlier and fill in as many fields as possible ahead of time. These can then be used to initilise the on disk representation. The code has been working towards the point where it is possible to remove gfs2_refresh_inode() because all the fields are correctly initialised ahead of time. We've now reached that milestone, and have reversed the order of setting up the in core and on disk inodes. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-08GFS2: Clean up inode creation pathSteven Whitehouse
This patch cleans up the inode creation code path in GFS2. After the Orlov allocator was merged, a number of potential improvements are now possible, and this is a first set of these. The quota handling is now updated so that it matches the point in the code where the allocation takes place. This means that the one exception in gfs2_alloc_blocks relating to quota is now no longer required, and we can use the generic code everywhere. In addition the call to figure out whether we need to allocate any extra blocks in order to add a directory entry is moved higher up gfs2_create_inode. This means that if it returns an error, we can deal with that at a stage where it is easier to handle that case. The returned status cannot change during the function since we hold an exclusive lock on the directory. Two calls to gfs2_rindex_update have been changed to one, again at the top of gfs2_create_inode to simplify error handling. The time stamps are also now initialised earlier in the creation process, this is gradually moving towards being able to remove the call to gfs2_refresh_inode in gfs2_inode_create once we have all the fields covered. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-05sysfs: check if one entry has been removed before freeingMing Lei
It might be a kernel disaster if one sysfs entry is freed but still referenced by sysfs tree. Recently Dave and Sasha reported one use-after-free problem on sysfs entry, and the problem has been troubleshooted with help of debug message added in this patch. Given sysfs_get_dirent/sysfs_put are exported APIs, even inside sysfs they are called in many contexts(kobject/attribe add/delete, inode init/drop, dentry lookup/release, readdir, ...), it is healthful to check the removed flag before freeing one entry and dump message if it is freeing without being removed first. Cc: Dave Jones <davej@redhat.com> Cc: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05NFSv4: Fix CB_RECALL_ANY to only return delegations that are not in useTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4: Clean up nfs_expire_all_delegationsTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4: Fix nfs_server_return_all_delegationsTrond Myklebust
If the state manager thread is already running, we may end up racing with it in nfs_client_return_marked_delegations. Better to just allow the state manager thread to do the job. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4: Be less aggressive about returning delegations for open filesTrond Myklebust
Currently, if the application that holds the file open isn't doing I/O, we may end up returning the delegation. This means that we can no longer cache the file as aggressively, and often also that we multiply the state that both the server and the client needs to track. This patch adds a check for open files to the routine that scans for delegations that are unreferenced. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4: Clean up delegation recall error handlingTrond Myklebust
Unify the error handling in nfs4_open_delegation_recall and nfs4_lock_delegation_recall. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4: Clean up nfs4_open_delegation_recallTrond Myklebust
Make it symmetric with nfs4_lock_delegation_recall Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4: Clean up nfs4_lock_delegation_recallTrond Myklebust
All error cases are handled by the switch() statement, meaning that the call to nfs4_handle_exception() is unreachable. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4: Handle NFS4ERR_DELAY and NFS4ERR_GRACE in nfs4_open_delegation_recallTrond Myklebust
A server shouldn't normally return NFS4ERR_GRACE if the client holds a delegation, since no conflicting lock reclaims can be granted, however the spec does not require the server to grant the open in this instance Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
2013-04-05NFSv4: Handle NFS4ERR_DELAY and NFS4ERR_GRACE in nfs4_lock_delegation_recallTrond Myklebust
A server shouldn't normally return NFS4ERR_GRACE if the client holds a delegation, since no conflicting lock reclaims can be granted, however the spec does not require the server to grant the lock in this instance. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
2013-04-05nfs: allow the v4.1 callback thread to freezeJeff Layton
The v4.1 callback thread has set_freezable() at the top, but it doesn't ever try to freeze within the loop. Have it call try_to_freeze() at the top of the loop. If a freeze event occurs, recheck kthread_should_stop() after thawing. Reported-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_listTrond Myklebust
It is unsafe to use list_for_each_entry_safe() here, because when we drop the nn->nfs_client_lock, we pin the _current_ list entry and ensure that it stays in the list, but we don't do the same for the _next_ list entry. Use of list_for_each_entry() is therefore the correct thing to do. Also fix the refcounting in nfs41_walk_client_list(). Finally, ensure that the nfs_client has finished being initialised and, in the case of NFSv4.1, that the session is set up. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Bryan Schumaker <bjschuma@netapp.com> Cc: stable@vger.kernel.org [>= 3.7]
2013-04-05NFSv4: Fix a memory leak in nfs4_discover_server_trunkingTrond Myklebust
When we assign a new rpc_client to clp->cl_rpcclient, we need to destroy the old one. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org [>=3.7]
2013-04-05NFSv4: Don't clear the machine cred when client establish returns EACCESTrond Myklebust
The expected behaviour is that the client will decide at mount time whether or not to use a krb5i machine cred, or AUTH_NULL. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Bryan Schumaker <bjschuma@netapp.com>
2013-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixesLinus Torvalds
Pull GFS2 fixes from Steven Whitehouse: "There are two patches which fix up a couple of minor issues in the DLM interface code, a missing error path in gfs2_rs_alloc(), one patch which fixes a problem during "withdraw" and a fix for discards/FITRIM when using 4k sector sized devices." * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes: GFS2: Issue discards in 512b sectors GFS2: Fix unlock of fcntl locks during withdrawn state GFS2: return error if malloc failed in gfs2_rs_alloc() GFS2: use memchr_inv GFS2: use kmalloc for lvb bitmap
2013-04-05xfs: don't free EFIs before the EFDs are committedDave Chinner
Filesystems are occasionally being shut down with this error: xfs_trans_ail_delete_bulk: attempting to delete a log item that is not in the AIL. It was diagnosed to be related to the EFI/EFD commit order when the EFI and EFD are in different checkpoints and the EFD is committed before the EFI here: http://oss.sgi.com/archives/xfs/2013-01/msg00082.html The real problem is that a single bit cannot fully describe the states that the EFI/EFD processing can be in. These completion states are: EFI EFI in AIL EFD Result committed/unpinned Yes committed OK committed/pinned No committed Shutdown uncommitted No committed Shutdown Note that the "result" field is what should happen, not what does happen. The current logic is broken and handles the first two cases correctly by luck. That is, the code will free the EFI if the XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The inverted logic "works" because if both EFI and EFD are committed, then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED bit, and the second frees the EFI item. Hence as long as xfs_efi_item_committed() has been called, everything appears to be fine. It is the third case where the logic fails - where xfs_efd_item_committed() is called before xfs_efi_item_committed(), and that results in the EFI being freed before it has been committed. That is the bug that triggered the shutdown, and hence keeping track of whether the EFI has been committed or not is insufficient to correctly order the EFI/EFD operations w.r.t. the AIL. What we really want is this: the EFI is always placed into the AIL before the last reference goes away. The only way to guarantee that is that the EFI is not freed until after it has been unpinned *and* the EFD has been committed. That is, restructure the logic so that the only case that can occur is the first case. This can be done easily by replacing the XFS_EFI_COMMITTED with an EFI reference count. The EFI is initialised with it's own count, and that is not released until it is unpinned. However, there is a complication to this method - the high level EFI/EFD code in xfs_bmap_finish() does not hold direct references to the EFI structure, and runs a transaction commit between the EFI and EFD processing. Hence the EFI can be freed even before the EFD is created using such a method. Further, log recovery uses the AIL for tracking EFI/EFDs that need to be recovered, but it uses the AIL *differently* to the EFI transaction commit. Hence log recovery never pins or unpins EFIs, so we can't drop the EFI reference count indirectly to free the EFI. However, this doesn't prevent us from using a reference count here. There is a 1:1 relationship between EFIs and EFDs, so when we initialise the EFI we can take a reference count for the EFD as well. This solves the xfs_bmap_finish() issue - the EFI will never be freed until the EFD is processed. In terms of log recovery, during the committing of the EFD we can look for the XFS_EFI_RECOVERED bit being set and drop the EFI reference as well, thereby ensuring everything works correctly there as well. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-04-05NFSv4: Fix issues in nfs4_discover_server_trunkingTrond Myklebust
- Ensure that we exit with ENOENT if the call to ops->get_clid_cred() fails. - Handle the case where ops->detect_trunking() exits with an unexpected error, and return EIO. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-05GFS2: Issue discards in 512b sectorsBob Peterson
This patch changes GFS2's discard issuing code so that it calls function sb_issue_discard rather than blkdev_issue_discard. The code was calling blkdev_issue_discard and specifying the correct sector offset and sector size, but blkdev_issue_discard expects these values to be in terms of 512 byte sectors, even if the native sector size for the device is different. Calling sb_issue_discard with the BLOCK size instead ensures the correct block-to-512b-sector translation. I verified that "minlen" is specified in blocks, so comparing it to a number of blocks is correct. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-04NFSv4: Fix the fallback to AUTH_NULL if krb5i is not availableTrond Myklebust
If the rpcsec_gss_krb5 module cannot be loaded, the attempt to create an rpc_client in nfs4_init_client will currently fail with an EINVAL. Fix is to retry with AUTH_NULL. Regression introduced by the commit "NFS: Use "krb5i" to establish NFSv4 state whenever possible" Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Bryan Schumaker <bjschuma@netapp.com>
2013-04-04NFS: Use server-recommended security flavor by default (NFSv3)Chuck Lever
Since commit ec88f28d in 2009, checking if the user-specified flavor is in the server's flavor list has been the source of a few noticeable regressions (now fixed), but there is one that is still vexing. An NFS server can list AUTH_NULL in its flavor list, which suggests a client should try to mount the server with the flavor of the client's choice, but the server will squash all accesses. In some cases, our client fails to mount a server because of this check, when the mount could have proceeded successfully. Skip this check if the user has specified "sec=" on the mount command line. But do consult the server-provided flavor list to choose a security flavor if no sec= option is specified on the mount command. If a server lists Kerberos pseudoflavors before "sys" in its export options, our client now chooses Kerberos over AUTH_UNIX for mount points, when no security flavor is specified by the mount command. This could be surprising to some administrators or users, who would then need to have Kerberos credentials to access the export. Or, a client administrator may not have enabled rpc.gssd. In this case, auth_rpcgss.ko might still be loadable, which is enough for the new logic to choose Kerberos over AUTH_UNIX. But the mount would fail since no GSS context can be created without rpc.gssd running. To retain the use of AUTH_UNIX by default: o The server administrator can ensure that "sys" is listed before Kerberos flavors in its export security options (see exports(5)), o The client administrator can explicitly specify "sec=sys" on its mount command line (see nfs(5)), o The client administrator can use "Sec=sys" in an appropriate section of /etc/nfsmount.conf (see nfsmount.conf(5)), or o The client administrator can blacklist auth_rpcgss.ko. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-04nfsd4: remove unused nfs4_check_deleg argumentJ. Bruce Fields
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-04nfsd4: make del_recall_lru per-network-namespaceJ. Bruce Fields
If nothing else this simplifies the nfs4_state_shutdown_net logic a tad. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-04nfsd4: shut down more of delegation earlierJ. Bruce Fields
Once we've unhashed the delegation, it's only hanging around for the benefit of an oustanding recall, which only needs the encoded filehandle, stateid, and dl_retries counter. No point keeping the file around any longer, or keeping it hashed. This also fixes a race: calls to idr_remove should really be serialized by the caller, but the nfs4_put_delegation call from the callback code isn't taking the state lock. (Better might be to cancel the callback before destroying the delegation, and remove any need for reference counting--but I don't see an easy way to cancel an rpc call.) Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-04nfsd4: minor cb_recall simplificationJ. Bruce Fields
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-04-04Merge tag 'upstream-3.9-rc6' of git://git.infradead.org/linux-ubifsLinus Torvalds
Pull UBIFS fix from Artem Bityutskiy: "Make the space fixup feature work in the case when the file-system is first mounted R/O and then remounted R/W." * tag 'upstream-3.9-rc6' of git://git.infradead.org/linux-ubifs: UBIFS: make space fixup work in the remount case
2013-04-04GFS2: Fix unlock of fcntl locks during withdrawn stateSteven Whitehouse
When withdraw occurs, we need to continue to allow unlocks of fcntl locks to occur, however these will only be local, since the node has withdrawn from the cluster. This prevents triggering a VFS level bug trap due to locks remaining when a file is closed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-04GFS2: return error if malloc failed in gfs2_rs_alloc()Wei Yongjun
The error code in gfs2_rs_alloc() is set to ENOMEM when error but never be used, instead, gfs2_rs_alloc() always return 0. Fix to return 'error'. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-04GFS2: use memchr_invAkinobu Mita
Use memchr_inv to verify that the specified memory range is cleared. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com>
2013-04-04GFS2: use kmalloc for lvb bitmapDavid Teigland
The temp lvb bitmap was on the stack, which could be an alignment problem for __set_bit_le. Use kmalloc for it instead. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-04-03pstore/ram: Restore ecc information blockArve Hjønnevåg
This was lost when proc/last_kmsg moved to pstore/console-ramoops. Signed-off-by: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Anton Vorontsov <anton@enomsg.org>
2013-04-03pstore/ram: Allow specifying ecc parameters in platform dataArve Hjønnevåg
Allow specifying ecc parameters in platform data Signed-off-by: Arve Hjønnevåg <arve@android.com> [jstultz: Tweaked commit subject & add commit message] Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Anton Vorontsov <anton@enomsg.org>
2013-04-03pstore/ram: Include ecc_size when calculating ecc_blockArve Hjønnevåg
Wastes less memory and allows using more memory for ecc than data. Signed-off-by: Arve Hjønnevåg <arve@android.com> [jstultz: Tweaked commit subject] Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Anton Vorontsov <anton@enomsg.org>
2013-04-03ext4: print more info in ext4_print_free_blocks()Lukas Czerner
Additionally print i_allocated_meta_blocks information as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03ext4: try to prepend extent to the existing oneLukas Czerner
Currently when inserting extent in ext4_ext_insert_extent() we would only try to to see if we can append new extent to the found extent. If we can not, then we proceed with adding new extent into the extent tree, but then possibly merging it back again. We can avoid this situation by trying to append and prepend new extent to the existing ones. However since the new extent can be on either sides of the existing extent, we have to pick the right extent to try to append/prepend to. This patch adds the conditions to pick the right extent to append/prepend to and adds the actual prepending condition as well. This will also eliminate the need to use "reserved" block for possibly growing extent tree. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: Transfer initialized block to right neighbor if possibleLukas Czerner
Currently when converting extent to initialized we attempt to transfer initialized block to the left neighbour if possible when certain criteria are met. However we do not attempt to do the same for the right neighbor. This commit adds the possibility to transfer initialized block to the right neighbour if: 1. We're not converting the whole extent 2. Both extents are stored in the same extent tree node 3. Right neighbor is initialized 4. Right neighbor is logically abutting the current one 5. Right neighbor is physically abutting the current one 6. Right neighbor would not overflow the length limit This is basically the same logic as with transferring to the left. This will gain us some performance benefits since it is faster than inserting extent and then merging it. It would also prevent some situation in delalloc patch when we might run out of metadata reservation. This is due to the fact that we would attempt to split the extent first (possibly allocating new metadata block) even though we did not counted for that because it can (and will) be merged again. This commit fix that scenario, because we no longer need to split the extent in such case. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03ext4: introduce ext4_get_group_number()Lukas Czerner
Currently on many places in ext4 we're using ext4_get_group_no_and_offset() even though we're only interested in knowing the block group of the particular block, not the offset within the block group so we can use more efficient way to compute block group. This patch introduces ext4_get_group_number() which computes block group for a given block much more efficiently. Use this function instead of ext4_get_group_no_and_offset() everywhere where we're only interested in knowing the block group. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: make ext4_block_in_group() much more efficientLukas Czerner
Currently in when getting the block group number for a particular block in ext4_block_in_group() we're using ext4_get_group_no_and_offset() which uses do_div() to get the block group and the remainer which is offset within the group. We don't need all of that in ext4_block_in_group() as we only need to figure out the group number. This commit changes ext4_block_in_group() to calculate group number directly. This shows as a big improvement with regards to cpu utilization. Measuring fallocate -l 15T on fresh file system with perf showed that 23% of cpu time was spend in the ext4_get_group_no_and_offset(). With this change it completely disappears from the list only bumping the occurrence of ext4_init_block_bitmap() which is the biggest user of ext4_block_in_group() by 4%. As the result of this change on my system the fallocate call was approx. 10% faster. However since there is '-g' option in mkfs which allow us setting different groups size (mostly for developers) I've introduced new per file system flag whether we have a standard block group size or not. The flag is used to determine whether we can use the bit shift optimization or not. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: unregister es_shrinker if mount failedDmitry Monakhov
Otherwise destroyed ext_sb_info will be part of global shinker list and result in the following OOPS: JBD2: corrupted journal superblock JBD2: recovery failed EXT4-fs (dm-2): error loading journal general protection fault: 0000 [#1] SMP Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\ mod CPU 1 Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136 /DH55TC RIP: 0010:[<ffffffff811bfb2d>] [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0 RSP: 0000:ffff88011d5cbcd8 EFLAGS: 00010207 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246 RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848 R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000 FS: 00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80) Stack: ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1 00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000 ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000 Call Trace: [<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0 [<ffffffff81249381>] mount_bdev+0x331/0x340 [<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180 [<ffffffff81362035>] ext4_mount+0x15/0x20 [<ffffffff8124869a>] mount_fs+0x9a/0x2e0 [<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170 [<ffffffff81279c02>] do_new_mount+0x172/0x2e0 [<ffffffff8127aa56>] do_mount+0x376/0x380 [<ffffffff8127ab98>] sys_mount+0x138/0x150 [<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\ 8 RIP [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0 RSP <ffff88011d5cbcd8> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-04-03ext4: fix journal callback list traversalDmitry Monakhov
It is incorrect to use list_for_each_entry_safe() for journal callback traversial because ->next may be removed by other task: ->ext4_mb_free_metadata() ->ext4_mb_free_metadata() ->ext4_journal_callback_del() This results in the following issue: WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 This patch fix the issue as follows: - ext4_journal_commit_callback() make list truly traversial safe simply by always starting from list_head - fix race between two ext4_journal_callback_del() and ext4_journal_callback_try_del() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.com
2013-04-03jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callbackDmitry Monakhov
The following race is possible: [kjournald2] other_task jbd2_journal_commit_transaction() j_state = T_FINISHED; spin_unlock(&journal->j_list_lock); ->jbd2_journal_remove_checkpoint() ->jbd2_journal_free_transaction(); ->kmem_cache_free(transaction) ->j_commit_callback(journal, transaction); -> USE_AFTER_FREE WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 In order to demonstrace this issue one should mount ext4 with mount -o discard option on SSD disk. This makes callback longer and race window becomes wider. In order to fix this we should mark transaction as finished only after callbacks have completed Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-04-03ext4: support simple conversion of extent-mapped inodes to use i_blocksTheodore Ts'o
In order to make it simpler to test the code which support i_blocks/indirect-mapped inodes, support the conversion of inodes which are less than 12 blocks and which are contained in no more than a single extent. The primary intended use of this code is to converting freshly created zero-length files and empty directories. Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a check that prevents the clearing of the extent flag. A simple patch which allows "chattr -e <file>" to work will be checked into the e2fsprogs git repository. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4/jbd2: don't wait (forever) for stale tid caused by wraparoundTheodore Ts'o
In the case where an inode has a very stale transaction id (tid) in i_datasync_tid or i_sync_tid, it's possible that after a very large (2**31) number of transactions, that the tid number space might wrap, causing tid_geq()'s calculations to fail. Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", attempted to fix this problem, but it only avoided kjournald spinning forever by fixing the logic in jbd2_log_start_commit(). Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c that might call jbd2_log_start_commit() with a stale tid, those functions will subsequently call jbd2_log_wait_commit() with the same stale tid, and then wait for a very long time. To fix this, we replace the calls to jbd2_log_start_commit() and jbd2_log_wait_commit() with a call to a new function, jbd2_complete_transaction(), which will correctly handle stale tid's. As a bonus, jbd2_complete_transaction() will avoid locking j_state_lock for writing unless a commit needs to be started. This should have a small (but probably not measurable) improvement for ext4's scalability. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com> Cc: stable@vger.kernel.org
2013-04-03ext4: add might_sleep() annotationsTheodore Ts'o
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>