summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2012-01-06Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) sched/tracing: Add a new tracepoint for sleeptime sched: Disable scheduler warnings during oopses sched: Fix cgroup movement of waking process sched: Fix cgroup movement of newly created process sched: Fix cgroup movement of forking process sched: Remove cfs bandwidth period check in tg_set_cfs_period() sched: Fix load-balance lock-breaking sched: Replace all_pinned with a generic flags field sched: Only queue remote wakeups when crossing cache boundaries sched: Add missing rcu_dereference() around ->real_parent usage [S390] fix cputime overflow in uptime_proc_show [S390] cputime: add sparse checking and cleanup sched: Mark parent and real_parent as __rcu sched, nohz: Fix missing RCU read lock sched, nohz: Set the NOHZ_BALANCE_KICK flag for idle load balancer sched, nohz: Fix the idle cpu check in nohz_idle_balance sched: Use jump_labels for sched_feat sched/accounting: Fix parameter passing in task_group_account_field sched/accounting: Fix user/system tick double accounting sched/accounting: Re-use scheduler statistics for the root cgroup ... Fix up conflicts in - arch/ia64/include/asm/cputime.h, include/asm-generic/cputime.h usecs_to_cputime64() vs the sparse cleanups - kernel/sched/fair.c, kernel/time/tick-sched.c scheduler changes in multiple branches
2012-01-06fs/9p: iattr_valid flags are kernel internal flags map them to 9p values.Aneesh Kumar K.V
Kernel internal values can change, add protocol values for these constant and use them. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2012-01-06ore: fix BUG_ON, too few sgs when readingBoaz Harrosh
When reading RAID5 files, in rare cases, we calculated too few sg segments. There should be two extra for the beginning and end partial units. Also "too few sg segments" should not be a BUG_ON there is all the mechanics in place to handle it, as a short read. So just return -ENOMEM and the rest of the code will gracefully split the IO. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06ore: Fix crash in case of an IO error.Boaz Harrosh
The users of ore_check_io() expect the reported device (In case of error) to be indexed relative to the passed-in ore_components table, and not the logical dev index. This causes a crash inside objlayoutdriver in case of an IO error. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06ore: FIX breakage when MISC_FILESYSTEMS is not setBoaz Harrosh
As Reported by Randy Dunlap When MISC_FILESYSTEMS is not enabled and NFS4.1 is: fs/built-in.o: In function `objio_alloc_io_state': objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state' fs/built-in.o: In function `_write_done': objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io' fs/built-in.o: In function `_read_done': ... When MISC_FILESYSTEMS, which is more of a GUI thing then anything else, is not selected. exofs/Kconfig is never examined during Kconfig, and it can not do it's magic stuff to automatically select everything needed. We must split exofs/Kconfig in two. The ore one is always included. And the exofs one is left in it's old place in the menu. [Needed for the 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06NFS: Remove pNFS bloat from the generic write pathTrond Myklebust
We have no business doing any this in the standard write release path. Get rid of it, and put it in the pNFS layer. Also, while we're at it, get rid of the completely bogus unlock/relock semantics that were present in nfs_writeback_release_full(). It is not only unnecessary, but actually dangerous to release the write lock just in order to take it again in nfs_page_async_flush(). Better just to open code the pgio operations in a pnfs helper. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-06pnfs-obj: Must return layout on IO errorBoaz Harrosh
As mandated by the standard. In case of an IO error, a pNFS objects layout driver must return it's layout. This is because all device errors are reported to the server as part of the layout return buffer. This is implemented the same way PNFS_LAYOUTRET_ON_SETATTR is done, through a bit flag on the pnfs_layoutdriver_type->flags member. The flag is set by the layout driver that wants a layout_return preformed at pnfs_ld_{write,read}_done in case of an error. (Though I have not defined a wrapper like pnfs_ld_layoutret_on_setattr because this code is never called outside of pnfs.c and pnfs IO paths) Without this patch 3.[0-2] Kernels leak memory and have an annoying WARN_ON after every IO error utilizing the pnfs-obj driver. [This patch is for 3.2 Kernel. 3.1/0 Kernels need a different patch] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-06pnfs-obj: pNFS errors are communicated on iodata->pnfs_errorBoaz Harrosh
Some time along the way pNFS IO errors were switched to communicate with a special iodata->pnfs_error member instead of the regular RPC members. But objlayout was not switched over. Fix that! Without this fix any IO error is hanged, because IO is not switched to MDS and pages are never cleared or read. [Applies to 3.2.0. Same bug different patch for 3.1/0 Kernels] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-06Merge tag 'v3.2' into staging/for_v3.3Mauro Carvalho Chehab
* tag 'v3.2': (83 commits) Linux 3.2 minixfs: misplaced checks lead to dentry leak ptrace: ensure JOBCTL_STOP_SIGMASK is not zero after detach ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race Revert "rtc: Expire alarms after the time is set." [CIFS] default ntlmv2 for cifs mount delayed to 3.3 cifs: fix bad buffer length check in coalesce_t2 Revert "rtc: Disable the alarm in the hardware" hung_task: fix false positive during vfork security: Fix security_old_inode_init_security() when CONFIG_SECURITY is not set fix CAN MAINTAINERS SCM tree type mwifiex: fix crash during simultaneous scan and connect b43: fix regression in PIO case ath9k: Fix kernel panic in AR2427 in AP mode CAN MAINTAINERS update net: fsl: fec: fix build for mx23-only kernel sch_qfq: fix overflow in qfq_update_start() drm/radeon/kms/atom: fix possible segfault in pm setup gspca: Fix falling back to lower isoc alt settings futex: Fix uninterruptible loop due to gate_area ...
2012-01-05ptrace: do not audit capability check when outputing /proc/pid/statEric Paris
Reading /proc/pid/stat of another process checks if one has ptrace permissions on that process. If one does have permissions it outputs some data about the process which might have security and attack implications. If the current task does not have ptrace permissions the read still works, but those fields are filled with inocuous (0) values. Since this check and a subsequent denial is not a violation of the security policy we should not audit such denials. This can be quite useful to removing ptrace broadly across a system without flooding the logs when ps is run or something which harmlessly walks proc. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05vfs: fix up ENOIOCTLCMD error handlingLinus Torvalds
We're doing some odd things there, which already messes up various users (see the net/socket.c code that this removes), and it was going to add yet more crud to the block layer because of the incorrect error code translation. ENOIOCTLCMD is not an error return that should be returned to user mode from the "ioctl()" system call, but it should *not* be translated as EINVAL ("Invalid argument"). It should be translated as ENOTTY ("Inappropriate ioctl for device"). That EINVAL confusion has apparently so permeated some code that the block layer actually checks for it, which is sad. We continue to do so for now, but add a big comment about how wrong that is, and we should remove it entirely eventually. In the meantime, this tries to keep the changes localized to just the EINVAL -> ENOTTY fix, and removing code that makes it harder to do the right thing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-05nfsd4: nfsd4_create_clid_dir return value is unusedJ. Bruce Fields
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-01-05NFSD: Change name of extended attribute containing junctionChuck Lever
As of fedfs-utils-0.8.0, user space stores all NFS junction information in a single extended attribute: "trusted.junction.nfs". Both FedFS and NFS basic junctions are stored in this one attribute, and the intention is that all future forms of NFS junction metadata will be stored in this attribute. Other protocols may use a different extended attribute. Thus NFSD needs to look only for that one extended attribute. The "trusted.junction.type" xattr is deprecated. fedfs-utils-0.8.0 will continue to attach a "trusted.junction.type" xattr to junctions, but future fedfs-utils releases may no longer do that. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-01-05nfsd4: be forgiving in the absence of the recovery directoryJ. Bruce Fields
If the recovery directory doesn't exist, then behavior after a reboot will be suboptimal. But it's unnecessarily harsh to then prevent the nfsv4 server from working at all. Instead just print a warning (already done in nfsd4_init_recdir()) and soldier on. Tested-by: Lior <lior@tonian.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-01-05NFS: Cache state owners after files are closedChuck Lever
Servers have a finite amount of memory to store NFSv4 open and lock owners. Moreover, servers may have a difficult time determining when they can reap their state owner table, thanks to gray areas in the NFSv4 protocol specification. Thus clients should be careful to reuse state owners when possible. Currently Linux is not too careful. When a user has closed all her files on one mount point, the state owner's reference count goes to zero, and it is released. The next OPEN allocates a new one. A workload that serially opens and closes files can run through a large number of open owners this way. When a state owner's reference count goes to zero, slap it onto a free list for that nfs_server, with an expiry time. Garbage collect before looking for a state owner. This makes state owners for active users available for re-use. Now that there can be unused state owners remaining at umount time, purge the state owner free list when a server is destroyed. Also be sure not to reclaim unused state owners during state recovery. This change has benefits for the client as well. For some workloads, this approach drops the number of OPEN_CONFIRM calls from the same as the number of OPEN calls, down to just one. This reduces wire traffic and thus open(2) latency. Before this patch, untarring a kernel source tarball shows the OPEN_CONFIRM call counter steadily increasing through the test. With the patch, the OPEN_CONFIRM count remains at 1 throughout the entire untar. As long as the expiry time is kept short, I don't think garbage collection should be terribly expensive, although it does bounce the clp->cl_lock around a bit. [ At some point we should rationalize the use of the nfs_server ->destroy method. ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Trond: Fixed a garbage collection race and a few efficiency issues] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05fs/9p: We should not allocate a new inode when creating hardlines.Aneesh Kumar K.V
Don't do new_inode_from fid in case of hardlink creation. This ensures that link count for hardlink files get updated properly. Earlier link count was not updated on removing a hardlink with cache mode enabled. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2012-01-05fs/9p: v9fs_stat2inode should update suid/sgid bits.Aneesh Kumar K.V
Create a new helper that update the permission bits and use that, instead of opencoding the logic. Reported and bisected by: M. Mohan Kumar <mohan@in.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2012-01-059p: Reduce object size with CONFIG_NET_9P_DEBUGJoe Perches
Reduce object size by deduplicating formats. Use vsprintf extension %pV. Rename P9_DPRINTK uses to p9_debug, align arguments. Add function for _p9_debug and macro to add __func__. Add missing "\n"s to p9_debug uses. Remove embedded function names as p9_debug adds it. Remove P9_EPRINTK macro and convert use to pr_<level>. Add and use pr_fmt and pr_<level>. $ size fs/9p/built-in.o* text data bss dec hex filename 62133 984 16000 79117 1350d fs/9p/built-in.o.new 67342 984 16928 85254 14d06 fs/9p/built-in.o.old $ size net/9p/built-in.o* text data bss dec hex filename 88792 4148 22024 114964 1c114 net/9p/built-in.o.new 94072 4148 23232 121452 1da6c net/9p/built-in.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2012-01-05fs/9p: check schedule_timeout_interruptible return valueJim Garlick
In v9fs_file_do_lock() we need to check return value of schedule_timeout_interruptible() and exit the loop when it returns nonzero, otherwise the loop is not really interruptible and after the signal, the loop is no longer throttled by P9_LOCK_TIMEOUT. Signed-off-by: Jim Garlick <garlick.jim@gmail.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2012-01-05NFS: Clean up nfs4_find_state_owners_locked()Chuck Lever
There's no longer a need to check the so_server field in the state owner, because nowadays the RB tree we search for state owners contains owners for that only server. Make nfs4_find_state_owners_locked() use the same tree searching logic as nfs4_insert_state_owner_locked(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4: include bitmap in nfsv4 get acl dataAndy Adamson
The NFSv4 bitmap size is unbounded: a server can return an arbitrary sized bitmap in an FATTR4_WORD0_ACL request. Replace using the nfs4_fattr_bitmap_maxsz as a guess to the maximum bitmask returned by a server with the inclusion of the bitmap (xdr length plus bitmasks) and the acl data xdr length to the (cached) acl page data. This is a general solution to commit e5012d1f "NFSv4.1: update nfs4_fattr_bitmap_maxsz" and fixes hitting a BUG_ON in xdr_shrink_bufhead when getting ACLs. Fix a bug in decode_getacl that returned -EINVAL on ACLs > page when getxattr was called with a NULL buffer, preventing ACL > PAGE_SIZE from being retrieved. Cc: stable@kernel.org Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05nfs: fix a minor do_div portability issueChris Metcalf
This change modifies filelayout_get_dense_offset() to use the functions in math64.h and thus avoid a 32-bit platform compile error trying to use do_div() on an s64 type. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Reviewed-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: cleanup comment and debug printkAndy Adamson
Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: change nfs4_free_slot parameters for dynamic slotsAndy Adamson
Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: cleanup init and reset of session slot tablesAndy Adamson
We are either initializing or resetting a session. Initialize or reset the session slot tables accordingly. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: fix backchannel slotid off-by-one bugAndy Adamson
Cc:stable@kernel.org Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05nfs: fix regression in handling of context= option in NFSv4Jeff Layton
Setting the security context of a NFSv4 mount via the context= mount option is currently broken. The NFSv4 codepath allocates a parsed options struct, and then parses the mount options to fill it. It eventually calls nfs4_remote_mount which calls security_init_mnt_opts. That clobbers the lsm_opts struct that was populated earlier. This bug also looks like it causes a small memory leak on each v4 mount where context= is used. Fix this by moving the initialization of the lsm_opts into nfs_alloc_parsed_mount_data. Also, add a destructor for nfs_parsed_mount_data to make it easier to free all of the allocations hanging off of it, and to ensure that the security_free_mnt_opts is called whenever security_init_mnt_opts is. I believe this regression was introduced quite some time ago, probably by commit c02d7adf. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFS - fix recent breakage to NFS error handling.NeilBrown
From c6d615d2b97fe305cbf123a8751ced859dca1d5e Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.de> Date: Wed, 16 Nov 2011 09:39:05 +1100 Subject: [PATCH] NFS - fix recent breakage to NFS error handling. commit 02c24a82187d5a628c68edfe71ae60dc135cd178 made a small and presumably unintended change to write error handling in NFS. Previously an error from filemap_write_and_wait_range would only be of interest if nfs_file_fsync did not return an error. After this commit, an error from filemap_write_and_wait_range would mean that (the rest of) nfs_file_fsync would not even be called. This means that: 1/ you are more likely to see EIO than e.g. EDQUOT or ENOSPC. 2/ NFS_CONTEXT_ERROR_WRITE remains set for longer so more writes are synchronous. This patch restores previous behaviour. Cc: stable@kernel.org Cc: Josef Bacik <josef@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05SUNRPC: Clean up the RPCSEC_GSS service ticket requestsTrond Myklebust
Instead of hacking specific service names into gss_encode_v1_msg, we should just allow the caller to specify the service name explicitly. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: J. Bruce Fields <bfields@redhat.com>
2012-01-05Btrfs: make sure we're not using obsolete code in btrfs_get_extentJan Schmidt
There's code in btrfs_get_extent that should never be used. This patch turns a WARN_ON(1) into a BUG(), hoping we can remove the transaction code from btrfs_get_extent soon. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-05Btrfs: new backref walking codeJan Schmidt
The old backref iteration code could only safely be used on commit roots. Besides this limitation, it had bugs in finding the roots for these references. This commit replaces large parts of it by btrfs_find_all_roots() which a) really finds all roots and the correct roots, b) works correctly under heavy file system load, c) considers delayed refs. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-04ext4: make more symbols staticEric Sandeen
A couple more functions can reasonably be made static if desired. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: make local symbol ext4_initxattrs staticDjalal Harouni
The ext4_initxattrs symbol is used only in this file, so it should be declared static. Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04jbd2: fix hung processes in jbd2_journal_lock_updates()Jan Kara
Toshiyuki Okajima found out that when running for ((i=0; i < 100000; i++)); do if ((i%2 == 0)); then chattr +j /mnt/file else chattr -j /mnt/file fi echo "0" >> /mnt/file done process sometimes hangs indefinitely in jbd2_journal_lock_updates(). Toshiyuki identified that the following race happens: jbd2_journal_lock_updates() |jbd2_journal_stop() ---------------------------------------+--------------------------------------- write_lock(&journal->j_state_lock) | . ++journal->j_barrier_count | . spin_lock(&tran->t_handle_lock) | . atomic_read(&tran->t_updates) //not 0 | | atomic_dec_and_test(&tran->t_updates) | // t_updates = 0 | wake_up(&journal->j_wait_updates) prepare_to_wait() | // no process is woken up. spin_unlock(&tran->t_handle_lock) | write_unlock(&journal->j_state_lock) | schedule() // never return | We fix the problem by first calling prepare_to_wait() and only after that checking t_updates in jbd2_journal_lock_updates(). Reported-and-analyzed-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: reserve new feature flag codepointsTheodore Ts'o
Reserve the ext4 features flags EXT4_FEATURE_RO_COMPAT_METADATA_CSUM, EXT4_FEATURE_INCOMPAT_INLINEDATA, and EXT4_FEATURE_INCOMPAT_LARGEDIR. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2012-01-04ext4: Report max_batch_time option correctlyBen Hutchings
Currently the value reported for max_batch_time is really the value of min_batch_time. Reported-by: Russell Coker <russell@coker.com.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-01-04minixfs: misplaced checks lead to dentry leakAl Viro
bitmap size sanity checks should be done *before* allocating ->s_root; there their cleanup on failure would be correct. As it is, we do iput() on root inode, but leak the root dentry... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-04ext4: add missing ext4_resize_end on error pathsDjalal Harouni
Online resize ioctls 'EXT4_IOC_GROUP_EXTEND' and 'EXT4_IOC_GROUP_ADD' call ext4_resize_begin() to check permissions and to set the EXT4_RESIZING bit lock, they do their work and they must finish with ext4_resize_end() which calls clear_bit_unlock() to unlock and to avoid -EBUSY errors for the next resize operations. This patch adds the missing ext4_resize_end() calls on error paths. Patch tested. Cc: stable@vger.kernel.org Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: let ext4_group_add() use common codeYongqiang Yang
This patch lets ext4_group_add() call ext4_flex_group_add(). Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: let ext4_group_extend() use common codeYongqiang Yang
ext4_group_extend_no_check() is moved out from ext4_group_extend(), this patch lets ext4_group_extend() call ext4_group_extentd_no_check() instead. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: add new online resize interfaceYongqiang Yang
This patch adds new online resize interface, whose input argument is a 64-bit integer indicating how many blocks there are in the resized fs. In new resize impelmentation, all work like allocating group tables are done by kernel side, so the new resize interface can support flex_bg feature and prepares ground for suppoting resize with features like bigalloc and exclude bitmap. Besides these, user-space tools just passes in the new number of blocks. We delay initializing the bitmaps and inode tables of added groups if possible and add multi groups (a flex groups) each time, so new resize is very fast like mkfs. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04Btrfs: added btrfs_find_all_roots()Jan Schmidt
This function gets a byte number (a data extent), collects all the leafs pointing to it and walks up the trees to find all fs roots pointing to those leafs. It also returns the list of all leafs pointing to that extent. It does proper locking for the involved trees, can be used on busy file systems and honors delayed refs. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-04Btrfs: add waitqueue instead of doing busy waiting for more delayed refsJan Schmidt
Now that we may be holding back delayed refs for a limited period, we might end up having no runnable delayed refs. Without this commit, we'd do busy waiting in that thread until another (runnable) ref arives. Instead, we're detecting this situation and use a waitqueue, such that we only try to run more refs after a) another runnable ref was added or b) delayed refs are no longer held back Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-04Btrfs: put back delayed refs that are too newArne Jansen
When processing a delayed ref, first check if there are still old refs in the process of being added. If so, put this ref back to the tree. To avoid looping on this ref, choose a newer one in the next loop. btrfs_find_ref_cluster has to take care of that. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-04Btrfs: add sequence numbers to delayed refsArne Jansen
Sequence numbers are needed to reconstruct the backrefs of a given extent to a certain point in time. The total set of backrefs consist of the set of backrefs recorded on disk plus the enqueued delayed refs for it that existed at that moment. This patch also adds a list that records all delayed refs which are currently in the process of being added. When walking all refs of an extent in btrfs_find_all_roots(), we freeze the current state of delayed refs, honor anythinh up to this point and prevent processing newer delayed refs to assert consistency. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-04Btrfs: add nested locking mode for pathsArne Jansen
This patch adds the possibilty to read-lock an extent even if it is already write-locked from the same thread. btrfs_find_all_roots() needs this capability. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-04dlm: add recovery callbacksDavid Teigland
These new callbacks notify the dlm user about lock recovery. GFS2, and possibly others, need to be aware of when the dlm will be doing lock recovery for a failed lockspace member. In the past, this coordination has been done between dlm and file system daemons in userspace, which then direct their kernel counterparts. These callbacks allow the same coordination directly, and more simply. Signed-off-by: David Teigland <teigland@redhat.com>
2012-01-04dlm: add node slots and generationDavid Teigland
Slot numbers are assigned to nodes when they join the lockspace. The slot number chosen is the minimum unused value starting at 1. Once a node is assigned a slot, that slot number will not change while the node remains a lockspace member. If the node leaves and rejoins it can be assigned a new slot number. A new generation number is also added to a lockspace. It is set and incremented during each recovery along with the slot collection/assignment. The slot numbers will be passed to gfs2 which will use them as journal id's. Signed-off-by: David Teigland <teigland@redhat.com>
2012-01-04dlm: move recovery barrier callsDavid Teigland
Put all the calls to recovery barriers in the same function to clarify where they each happen. Should not change any behavior. Also modify some recovery debug lines to make them consistent. Signed-off-by: David Teigland <teigland@redhat.com>