summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2008-11-06jbd2: Call journal commit callback without holding j_list_lockAneesh Kumar K.V
Avoid freeing the transaction in __jbd2_journal_drop_transaction() so the journal commit callback can run without holding j_list_lock, to avoid lock contention on this spinlock. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-25ext4: cleanup mballoc header filesAneesh Kumar K.V
Move some of the forward declaration of the static functions to mballoc.c where they are used. This enables us to include mballoc.h in other .c files. Also correct the buddy cache documentation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resizeAneesh Kumar K.V
The new groups added during resize are flagged as need_init group. Make sure we properly initialize these groups. When we have block size < page size and we are adding new groups the page may still be marked uptodate even though we haven't initialized the group. While forcing the init of buddy cache we need to make sure other groups part of the same page of buddy cache is not using the cache. group_info->alloc_sem is added to ensure the same. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> cc: stable@kernel.org
2009-01-05ext4: Add blocks added during resize to bitmapAneesh Kumar K.V
With this change new blocks added during resize are marked as free in the block bitmap and the group is flagged with EXT4_GROUP_INFO_NEED_INIT_BIT flag. This makes sure when mballoc tries to allocate blocks from the new group we would reload the buddy information using the bitmap present in the disk. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2008-11-22ext4: sparse fixesAneesh Kumar K.V
* Change EXT4_HAS_*_FEATURE to return a boolean * Add a function prototype for ext4_fiemap() in ext4.h * Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions * Add lock annotations to mb_free_blocks() Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-05jbd2: Remove a large array of bh's from the stack of the checkpoint routineTheodore Ts'o
jbd2_log_do_checkpoint()n is one of the kernel's largest stack users. Move the array of buffer head's from the stack of jbd2_log_do_checkpoint() to the in-core journal structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-05ext4: Change unsigned long to unsigned intTheodore Ts'o
Convert the unsigned longs that are most responsible for bloating the stack usage on 64-bit systems. Nearly all places in the ext3/4 code which uses "unsigned long" is probably a bug, since on 32-bit systems a ulong a 32-bits, which means we are wasting stack space on 64-bit systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05ext4: Make ext4_group_t be an unsigned intTheodore Ts'o
Nearly all places in the ext3/4 code which uses "unsigned long" is probably a bug, since on 32-bit systems a ulong a 32-bits, which means we are wasting stack space on 64-bit systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-04ext4: Remove i_ext_generation from ext4_inode_info structureTheodore Ts'o
The i_ext_generation was incremented, but never used. Remove it to slim down the ext4_inode_info structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03ext4: add fsync batch tuning knobsTheodore Ts'o
Add new mount options, min_batch_time and max_batch_time, which controls how long the jbd2 layer should wait for additional filesystem operations to get batched with a synchronous write transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-17ext4: display average commit timeTheodore Ts'o
Display the average commit time (which is used by the ext4 fsync batching patch) in /proc/fs/jbd2/*/info for performance tuning purposes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-26jbd2: improve jbd2 fsync batchingJosef Bacik
This patch removes the static sleep time in favor of a more self optimizing approach where we measure the average amount of time it takes to commit a transaction to disk and the ammount of time a transaction has been running. If somebody does a sync write or an fsync() traditionally we would sleep for 1 jiffies, which depending on the value of HZ could be a significant amount of time compared to how long it takes to commit a transaction to the underlying storage. With this patch instead of sleeping for a jiffie, we check to see if the amount of time this transaction has been running is less than the average commit time, and if it is we sleep for the delta using schedule_hrtimeout to give us a higher precision sleep time. This greatly benefits high end storage where you could end up sleeping for longer than it takes to commit the transaction and therefore sitting idle instead of allowing the transaction to be committed by keeping the sleep time to a minimum so you are sure to always be doing something. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05ext4: Don't overwrite allocation_context ac_statusAneesh Kumar K.V
We can call ext4_mb_check_limits even after successfully allocating the requested blocks. In that case, make sure we don't overwrite ac_status if it already has the status AC_STATUS_FOUND. This fixes the lockdep warning: ============================================= [ INFO: possible recursive locking detected ] 2.6.28-rc6-autokern1 #1 --------------------------------------------- fsstress/11948 is trying to acquire lock: (&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278 ..... stack backtrace: ..... [<c04db974>] ext4_mb_regular_allocator+0xbb5/0xd44 ..... but task is already holding lock: (&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2009-01-05ext4: remove extraneous newlines from calls to ext4_error() and ext4_warning()Theodore Ts'o
This removes annoying blank syslog entries emitted by ext4_error() or ext4_warning(), since these functions add their own newline. Signed-off-by: Nick Warne <nick@ukfsn.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05jbd2: Add barrier not supported test to journal_wait_on_commit_recordTheodore Ts'o
Xen doesn't report that barriers are not supported until buffer I/O is reported as completed, instead of when the buffer I/O is submitted. Add a check and a fallback codepath to journal_wait_on_commit_record() to detect this case, so that attempts to mount ext4 filesystems on LVM/devicemapper devices on Xen guests don't blow up with an "Aborting journal on device XXX"; "Remounting filesystem read-only" error. Thanks to Andreas Sundstrom for reporting this issue. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2009-01-07ext4: Allow ext4 to run without a journalFrank Mayhar
A few weeks ago I posted a patch for discussion that allowed ext4 to run without a journal. Since that time I've integrated the excellent comments from Andreas and fixed several serious bugs. We're currently running with this patch and generating some performance numbers against both ext2 (with backported reservations code) and ext4 with and without a journal. It just so happens that running without a journal is slightly faster for most everything. We did iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2 which creates 4 threads, each of which create and do reads and writes on a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens to bypass the page cache. Results: ext2 ext4, default ext4, no journal initial writes 13.0 MB/s 15.4 MB/s 15.7 MB/s rewrites 13.1 MB/s 15.6 MB/s 15.9 MB/s reads 15.2 MB/s 16.9 MB/s 17.2 MB/s re-reads 15.3 MB/s 16.9 MB/s 17.2 MB/s random readers 5.6 MB/s 5.6 MB/s 5.7 MB/s random writers 5.1 MB/s 5.3 MB/s 5.4 MB/s So it seems that, so far, this was a useful exercise. Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-17ext4: Widen type of ext4_sb_info.s_mb_maxs[]Yasunori Goto
I chased the cause of following ext4 oops report which is tested on ia64 box. http://bugzilla.kernel.org/show_bug.cgi?id=12018 The cause is the size of s_mb_maxs array that is defined as "unsigned short" in ext4_sb_info structure. If the file system's block size is 8k or greater, an unsigned short is not wide enough to contain the value fs->blocksize << 3. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: stable@kernel.org
2008-11-26ext4: When resizing set the EXT4_BG_INODE_ZEROED flag for new block groupsSolofo.Ramangalahy@bull.net
The inode table has been zeroed in setup_new_group_blocks(). Mark it as such in ext4_group_add(). Since we are currently clearing inode table for the new block group, we should set the EXT4_BG_INODE_ZEROED flag. If at some point in the future we don't immediately zero out the inode table as part of the resize operation, then obviously we shouldn't do this. Signed-off-by: Solofo.Ramangalahy@bull.net Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-26ext4: Use simple_strtol() instead of simple_strtoul() in ext4_ui_proc_openRoel Kluin
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-25ext4: fix build warningWu Fengguang
Replace `if' with `goto' to assure gcc that ix has been initialized. Signed-off-by: Wu Fengguang <wfg@linux.intel.com>
2009-01-05ext4: avoid ext4_error when mounting a fs with a single bgAneesh Kumar K.V
Remove some completely unneeded code which which caused an ext4_error to be generated when mounting a file system with only a single block group. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2009-01-05ext4: Fix the delalloc writepages to allocate blocks at the right offset.Aneesh Kumar K.V
When iterating through the pages which have mapped buffer_heads, we failed to update the b_state value. This results in allocating blocks at logical offset 0. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2008-11-05ext4: tone down ext4_da_writepages warningsTheodore Ts'o
If the filesystem has errors, ext4_da_writepages() will return a *lot* of errors, including lots and lots of stack dumps. While it's true that we are dropping user data on the floor, which is unfortunate, the stack dumps aren't helpful, and they tend to obscure the true original root cause of the problem. So in the case where the filesystem has aborted, return an EROFS right away. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-12ext4: remove do_blk_alloc()Theodore Ts'o
The convenience function do_blk_alloc() is a static function with only one caller, so fold it into ext4_new_meta_blocks() to simplify the code and to make it easier to understand. To save more stack space, if count is a null pointer in ext4_new_meta_blocks() assume that caller wanted a single block (and if there is an error, no blocks were allocated). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-07ext4: remove ext4_new_meta_block()Theodore Ts'o
There were only two one callers of the function ext4_new_meta_block(), which just a very simpler wrapper function around ext4_new_meta_blocks(). Change those two functions to call ext4_new_meta_blocks() directly, to save code and stack space usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-01ext4: remove ext4_new_blocks() and call ext4_mb_new_blocks() directlyTheodore Ts'o
There was only one caller of the compatibility function ext4_new_blocks(), in balloc.c's ext4_alloc_blocks(). Change it to call ext4_mb_new_blocks() directly, and remove ext4_new_blocks() altogether. This cleans up the code, by removing two extra functions from the call chain, and hopefully saving some stack usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-06ext3/4: Fix loop index in do_split() so it is signedTheodore Ts'o
This fixes a gcc warning but it doesn't appear able to result in a failure, since the primary way the loop is exited is the first conditional in the for loop, and at least for a consistent filesystem, the signed/unsigned should in practice never be exposed. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-28ext4: Add support for non-native signed/unsigned htree hash algorithmsTheodore Ts'o
The original ext3 hash algorithms assumed that variables of type char were signed, as God and K&R intended. Unfortunately, this assumption is not true on some architectures. Userspace support for marking filesystems with non-native signed/unsigned chars was added two years ago, but the kernel-side support was never added (until now). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-28ext3: Add support for non-native signed/unsigned htree hash algorithmsTheodore Ts'o
The original ext3 hash algorithms assumed that variables of type char were signed, as God and K&R intended. Unfortunately, this assumption is not true on some architectures. Userspace support for marking filesystems with non-native signed/unsigned chars was added two years ago, but the kernel-side support was never added (until now). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org
2008-10-29ext4: fix printk format warningAlexander Beregalov
fs/ext4/balloc.c:607: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1822: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1824: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2009-01-04Merge branch 'audit.b61' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b61' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: audit: validate comparison operations, store them in sane form clean up audit_rule_{add,del} a bit make sure that filterkey of task,always rules is reported audit rules ordering, part 2 fixing audit rule ordering mess, part 1 audit_update_lsm_rules() misses the audit_inode_hash[] ones sanitize audit_log_capset() sanitize audit_fd_pair() sanitize audit_mq_open() sanitize AUDIT_MQ_SENDRECV sanitize audit_mq_notify() sanitize audit_mq_getsetattr() sanitize audit_ipc_set_perm() sanitize audit_ipc_obj() sanitize audit_socketcall don't reallocate buffer in every audit_sockaddr()
2009-01-04fs: symlink write_begin allocation context fixNick Piggin
With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-04fs: introduce bgl_lock_ptr()Pekka Enberg
As suggested by Andreas Dilger, introduce a bgl_lock_ptr() helper in <linux/blockgroup_lock.h> and add separate sb_bgl_lock() helpers to filesystem specific header files to break the hidden dependency to struct ext[234]_sb_info. Also, while at it, convert the macros to static inlines to try make up for all the times I broke Andrew Morton's tree. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-04sanitize audit_fd_pair()Al Viro
* no allocations * return void Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-01-03Merge branch 'cpus4096-for-linus-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits) x86: setup_per_cpu_areas() cleanup cpumask: fix compile error when CONFIG_NR_CPUS is not defined cpumask: use alloc_cpumask_var_node where appropriate cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t x86: use cpumask_var_t in acpi/boot.c x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids sched: put back some stack hog changes that were undone in kernel/sched.c x86: enable cpus display of kernel_max and offlined cpus ia64: cpumask fix for is_affinity_mask_valid() cpumask: convert RCU implementations, fix xtensa: define __fls mn10300: define __fls m32r: define __fls h8300: define __fls frv: define __fls cris: define __fls cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS cpumask: zero extra bits in alloc_cpumask_var_node cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/ cpumask: convert mm/ ...
2009-01-03get rid of special-casing the /sbin/loader on alphaAl Viro
... just make it a binfmt handler like #! one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-03sanitize ifdefs in binfmt_aoutAl Viro
They are actually alpha vs. i386/arm/m68k i.e. ecoff vs. aout. In the only place where we actually tried to handle arm and i386/m68k in different ways (START_DATA() in coredump handling), the arm variant works for all of them (i386 and m68k have u.start_code set to 0). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-03remove the rudiment of a.out for sparcAl Viro
it's been used only in sunos compat Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6Linus Torvalds
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (33 commits) UBIFS: add more useful debugging prints UBIFS: print debugging messages properly UBIFS: fix numerous spelling mistakes UBIFS: allow mounting when short of space UBIFS: fix writing uncompressed files UBIFS: fix checkpatch.pl warnings UBIFS: fix sparse warnings UBIFS: simplify make_free_space UBIFS: do not lie about used blocks UBIFS: restore budg_uncommitted_idx UBIFS: always commit on unmount UBIFS: use ubi_sync UBIFS: always commit in sync_fs UBIFS: fix file-system synchronization UBIFS: fix constants initialization UBIFS: avoid unnecessary calculations UBIFS: re-calculate min_idx_size after the commit UBIFS: use nicer 64-bit math UBIFS: fix available blocks count UBIFS: various comment improvements and fixes ...
2009-01-02Merge branch 'kvm-updates/2.6.29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates/2.6.29' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (140 commits) KVM: MMU: handle large host sptes on invlpg/resync KVM: Add locking to virtual i8259 interrupt controller KVM: MMU: Don't treat a global pte as such if cr4.pge is cleared MAINTAINERS: Maintainership changes for kvm/ia64 KVM: ia64: Fix kvm_arch_vcpu_ioctl_[gs]et_regs() KVM: x86: Rework user space NMI injection as KVM_CAP_USER_NMI KVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip KVM: fix handling of ACK from shared guest IRQ KVM: MMU: check for present pdptr shadow page in walk_shadow KVM: Consolidate userspace memory capability reporting into common code KVM: Advertise the bug in memory region destruction as fixed KVM: use cpumask_var_t for cpus_hardware_enabled KVM: use modern cpumask primitives, no cpumask_t on stack KVM: Extract core of kvm_flush_remote_tlbs/kvm_reload_remote_mmus KVM: set owner of cpu and vm file operations anon_inodes: use fops->owner for module refcount x86: KVM guest: kvm_get_tsc_khz: return khz, not lpj KVM: MMU: prepopulate the shadow on invlpg KVM: MMU: skip global pgtables on sync due to cr3 switch KVM: MMU: collapse remote TLB flushes on root sync ...
2009-01-02CRED: Wrap task credential accesses in the devpts filesystemDavid Howells
Wrap access to task credentials so that they can be separated more easily from the task_struct during the introduction of COW creds. Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id(). Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more sense to use RCU directly rather than a convenient wrapper; these will be addressed by later patches. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02devpts: fix unused function warningAndrew Morton
fs/devpts/inode.c:324: warning: 'compare_init_pts_sb' defined but not used Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02devpts: Coding style clean upAlan Cox
Just nail the oddments now while this code is being touched Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Enable multiple instances of devptsSukadev Bhattiprolu
To support containers, allow multiple instances of devpts filesystem, such that indices of ptys allocated in one instance are independent of ptys allocated in other instances of devpts. But to preserve backward compatibility, enable this support for multiple instances only if: - CONFIG_DEVPTS_MULTIPLE_INSTANCES is set to Y, and - '-o newinstance' mount option is specified while mounting devpts To use multi-instance mount, a container startup script could: $ ns_exec -cm /bin/bash $ umount /dev/pts $ mount -t devpts -o newinstance lxcpts /dev/pts $ mount -o bind /dev/pts/ptmx /dev/ptmx $ /usr/sbin/sshd -p 1234 where 'ns_exec -cm /bin/bash' is calls clone() with CLONE_NEWNS flag and execs /bin/bash in the child process. A pty created by the sshd is not visible in the original mount of /dev/pts. USER-SPACE-IMPACT: - See Documentation/fs/devpts.txt (included in next patch) for user- space impact in multi-instance and mixed-mode operation. TODO: - Update mount(8), pts(4) man pages. Highlight impact of not redirecting /dev/ptmx to /dev/pts/ptmx after a multi-instance mount. Changelog[v6]: - [Dave Hansen] Use new get_init_pts_sb() interface - [Serge Hallyn] Don't bother displaying 'newinstance' in show_options - [Serge Hallyn] Use macros (PARSE_REMOUNT/PARSE_MOUNT) instead of 0/1. - [Serge Hallyn] Check error return from get_sb_single() (now get_init_pts_sb()) - devpts_pty_kill(): don't dput error dentries Changelog[v5]: - Move get_sb_ref() definition to earlier patch - Move usage info to Documentation/filesystems/devpts.txt (next patch) - Make ptmx node even in init_pts_ns, now that default mode is 0000 (defined in earlier patch, enabled here). - Cache ptmx dentry and use to update mode during remount (defined in earlier patch, enabled here). - Bugfix: explicitly ignore newinstance on remount (if newinstance was specified on remount of initial mount, it would be ignored but /proc/mounts would imply that the option was set) Changelog[v4]: - Update patch description to address H. Peter Anvin's comments - Consolidate multi-instance mode code under new config token, CONFIG_DEVPTS_MULTIPLE_INSTANCE. - Move usage-details from patch description to Documentation/fs/devpts.txt Changelog[v3]: - Rename new mount option to 'newinstance' - Create ptmx nodes only in 'newinstance' mounts - Bugfix: parse_mount_options() modifies @data but since we need to parse the @data twice (once in devpts_get_sb() and once during do_remount_sb()), parse a local copy of @data in devpts_get_sb(). (restructured code in devpts_get_sb() to fix this) Changelog[v2]: - Support both single-mount and multiple-mount semantics and provide '-onewmnt' option to select the semantics. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Define get_init_pts_sb()Sukadev Bhattiprolu
See comments in the function header for details. The new interface will be used in a follow-on patch. Changelog [v2]: [Dave Hansen] Replace get_sb_ref() in fs/super.c with get_init_pts_sb() and make the new interface private to devpts Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Define mknod_ptmx()Sukadev Bhattiprolu
/dev/ptmx is closely tied to the devpts filesystem. An open of /dev/ptmx, allocates the next pty index and the associated device shows up in the devpts fs as /dev/pts/n. Wih multiple instancs of devpts filesystem, during an open of /dev/ptmx we would be unable to determine which instance of the devpts is being accessed. So we move the 'ptmx' node into /dev/pts and use the inode of the 'ptmx' node to identify the superblock and hence the devpts instance. This patch adds ability for the kernel to internally create the [ptmx, c, 5:2] device when mounting devpts filesystem. Since the ptmx node in devpts is new and may surprise some userspace scripts, the default permissions for the new node is 0000. These permissions can be changed either using chmod or by remounting with the new '-o ptmxmode=0666' mount option. Changelog[v5]: - [Serge Hallyn bugfix]: Letting new_inode() assign inode number to ptmx can collide with hand-assigning inode numbers to ptys. So, hand-assign specific inode number to ptmx node also. - [Serge Hallyn]: Maybe safer to grab root dentry mutex while creating ptmx node - [Bugfix with Serge Hallyn] Replace lookup_one_len() in mknod_ptmx() wih d_alloc_name() (lookup during ->get_sb() locks up system). To simplify patchset, fold the ptmx_dentry patch into this. Changelog[v4]: - Change default permissions of pts/ptmx node to 0000. - Move code for ptmxmode under #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES. Changelog[v3]: - Rename ptmx_mode to ptmxmode (for consistency with 'newinstance') Changelog[v2]: - [H. Peter Anvin] Remove mknod() system call support and create the ptmx node internally. Changelog[v1]: - Earlier version of this patch enabled creating /dev/pts/tty as well. As pointed out by Al Viro and H. Peter Anvin, that is not really necessary. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Extract option parsing to new functionSukadev Bhattiprolu
Move code to parse mount options into a separate function so it can (later) be shared between mount and remount operations. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Per-mount 'config' objectSukadev Bhattiprolu
With support for multiple mounts of devpts, the 'config' structure really represents per-mount options rather than config parameters. Rename 'config' structure to 'pts_mount_opts' and store it in the super-block. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Per-mount allocated_ptysSukadev Bhattiprolu
To enable multiple mounts of devpts, 'allocated_ptys' must be a per-mount variable rather than a global variable. Move 'allocated_ptys' into the super_block's s_fs_info. Changelog[v2]: Define and use DEVPTS_SB() wrapper. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-02Remove devpts_root globalSukadev Bhattiprolu
Remove the 'devpts_root' global variable and find the root dentry using the super_block. The super-block can be found from the device inode, using the new wrapper, pts_sb_from_inode(). Changelog: This patch is based on an earlier patchset from Serge Hallyn and Matt Helsley. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>