summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2007-10-17Implement file posix capabilitiesSerge E. Hallyn
Implement file posix capabilities. This allows programs to be given a subset of root's powers regardless of who runs them, without having to use setuid and giving the binary all of root's powers. This version works with Kaigai Kohei's userspace tools, found at http://www.kaigai.gr.jp/index.php. For more information on how to use this patch, Chris Friedhoff has posted a nice page at http://www.friedhoff.org/fscaps.html. Changelog: Nov 27: Incorporate fixes from Andrew Morton (security-introduce-file-caps-tweaks and security-introduce-file-caps-warning-fix) Fix Kconfig dependency. Fix change signaling behavior when file caps are not compiled in. Nov 13: Integrate comments from Alexey: Remove CONFIG_ ifdef from capability.h, and use %zd for printing a size_t. Nov 13: Fix endianness warnings by sparse as suggested by Alexey Dobriyan. Nov 09: Address warnings of unused variables at cap_bprm_set_security when file capabilities are disabled, and simultaneously clean up the code a little, by pulling the new code into a helper function. Nov 08: For pointers to required userspace tools and how to use them, see http://www.friedhoff.org/fscaps.html. Nov 07: Fix the calculation of the highest bit checked in check_cap_sanity(). Nov 07: Allow file caps to be enabled without CONFIG_SECURITY, since capabilities are the default. Hook cap_task_setscheduler when !CONFIG_SECURITY. Move capable(TASK_KILL) to end of cap_task_kill to reduce audit messages. Nov 05: Add secondary calls in selinux/hooks.c to task_setioprio and task_setscheduler so that selinux and capabilities with file cap support can be stacked. Sep 05: As Seth Arnold points out, uid checks are out of place for capability code. Sep 01: Define task_setscheduler, task_setioprio, cap_task_kill, and task_setnice to make sure a user cannot affect a process in which they called a program with some fscaps. One remaining question is the note under task_setscheduler: are we ok with CAP_SYS_NICE being sufficient to confine a process to a cpuset? It is a semantic change, as without fsccaps, attach_task doesn't allow CAP_SYS_NICE to override the uid equivalence check. But since it uses security_task_setscheduler, which elsewhere is used where CAP_SYS_NICE can be used to override the uid equivalence check, fixing it might be tough. task_setscheduler note: this also controls cpuset:attach_task. Are we ok with CAP_SYS_NICE being used to confine to a cpuset? task_setioprio task_setnice sys_setpriority uses this (through set_one_prio) for another process. Need same checks as setrlimit Aug 21: Updated secureexec implementation to reflect the fact that euid and uid might be the same and nonzero, but the process might still have elevated caps. Aug 15: Handle endianness of xattrs. Enforce capability version match between kernel and disk. Enforce that no bits beyond the known max capability are set, else return -EPERM. With this extra processing, it may be worth reconsidering doing all the work at bprm_set_security rather than d_instantiate. Aug 10: Always call getxattr at bprm_set_security, rather than caching it at d_instantiate. [morgan@kernel.org: file-caps clean up for linux/capability.h] [bunk@kernel.org: unexport cap_inode_killpriv] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17r/o bind mounts: create cleanup helper svc_msnfs()Dave Hansen
I'm going to be modifying nfsd_rename() shortly to support read-only bind mounts. This #ifdef is around the area I'm patching, and it starts to get really ugly if I just try to add my new code by itself. Using this little helper makes things a lot cleaner to use. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17r/o bind mounts: give permission() a local 'mnt' variableDave Hansen
First of all, this makes the structure jumping look a little bit cleaner. So, this stands alone as a tiny cleanup. But, we also need 'mnt' by itself a few more times later in this series, so this isn't _just_ a cleanup. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17r/o bind mounts: rearrange may_open() to be r/o friendlyDave Hansen
may_open() calls vfs_permission() before it does checks for IS_RDONLY(inode). It checks _again_ inside of vfs_permission(). The check inside of vfs_permission() is going away eventually. With the mnt_want/drop_write() functions, all of the r/o checks (except for this one) are consistently done before calling permission(). Because of this, I'd like to use permission() to hold a debugging check to make sure that the mnt_want/drop_write() calls are actually being made. So, to do this: 1. remove the IS_RDONLY() check from permission() 2. enforce that you must mnt_want_write() before even calling permission() 3. actually add the debugging check to permission() We need to rearrange may_open() to do r/o checks before calling permission(). Here's the patch. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17r/o bind mounts: filesystem helpers for custom 'struct file'sDave Hansen
Why do we need r/o bind mounts? This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated. I've been using the following script to test that the feature is working as desired. It takes a directory and makes a regular bind and a r/o bind mount of it. It then performs some normal filesystem operations on the three directories, including ones that are expected to fail, like creating a file on the r/o mount. This patch: Some filesystems forego the vfs and may_open() and create their own 'struct file's. This patch creates a couple of helper functions which can be used by these filesystems, and will provide a unified place which the r/o bind mount code may patch. Also, rename an existing, static-scope init_file() to a less generic name. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: clean up execute permission checkingMiklos Szeredi
Define a new function fuse_refresh_attributes() that conditionally refreshes the attributes based on the validity timeout. In fuse_permission() only refresh the attributes for checking the execute bits if necessary. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: no ENOENT from fuse device readMiklos Szeredi
Don't return -ENOENT for a read() on the fuse device when the request was aborted. Instead return -ENODEV, meaning the filesystem has been force-umounted or aborted. Previously ENOENT meant that the request was interrupted, but now the 'aborted' flag is not set in case of interrupts. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: no abort on interruptMiklos Szeredi
Don't set 'aborted' flag on a request if it's interrupted. We have to wait for the answer anyway, and this would only a very little time while copying the reply. This means, that write() on the fuse device will not return -ENOENT during normal operation, only if the filesystem is aborted by a forced umount or through the fusectl interface. This could simplify userspace code somewhat when backward compatibility with earlier kernel versions is not required. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: cleanup in releaseMiklos Szeredi
Move dput/mntput pair from request_end() to fuse_release_end(), because there's no other place they are used. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: fix permission checking on sticky directoriesMiklos Szeredi
The VFS checks sticky bits on the parent directory even if the filesystem defines it's own ->permission(). In some situations (sshfs, mountlo, etc) the user does have permission to delete a file even if the attribute based checking would not allow it. So work around this by storing the permission bits separately and returning them in stat(), but cutting the permission bits off from inode->i_mode. This is slightly hackish, but it's probably not worth it to add new infrastructure in VFS and a slight performance penalty for all filesystems, just for the sake of fuse. [Jan Engelhardt] cosmetic fixes Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Jan Engelhardt <jengelh@linux01.gwdg.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: refresh stale attributes in fuse_permission()Miklos Szeredi
fuse_permission() didn't refresh inode attributes before using them, even if the validity has already expired. Thanks to Junjiro Okajima for spotting this. Also remove some old code to unconditionally refresh the attributes on the root inode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: set i_nlink to sane value after mountMiklos Szeredi
Aufs seems to depend on a positive i_nlink value. So fill in a dummy but sane value for the root inode at mount time. The inode attributes are refreshed with the correct values at the first opportunity. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: fix page invalidationMiklos Szeredi
Other than truncate, there are two cases, when fuse tries to get rid of cached pages: a) in open, if KEEP_CACHE flag is not set b) in getattr, if file size changed spontaneously Until now invalidate_mapping_pages() were used, which didn't get rid of mapped pages. This is wrong, and becomes more wrong as dirty pages are introduced. So instead properly invalidate all pages with invalidate_inode_pages2(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: truncate on spontaneous size changeMiklos Szeredi
Memory mappings were only truncated on an explicit truncate, but not when the file size was changed externally. Fix this by moving the truncation code from fuse_setattr to fuse_change_attributes. Yes, there are races between write and and external truncation, but we can't really do anything about them. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: add reference counting to fuse_fileMiklos Szeredi
Make lifetime of 'struct fuse_file' independent from 'struct file' by adding a reference counter and destructor. This will enable asynchronous page writeback, where it cannot be guaranteed, that the file is not released while a request with this file handle is being served. The actual RELEASE request is only sent when there are no more references to the fuse_file. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: fix reserved request wake upMiklos Szeredi
Use wake_up_all instead of wake_up in put_reserved_req(), otherwise it is possible that the right task is not woken up. Also create a separate reserved_req_waitq in addition to the blocked_waitq, since they fulfill totally separate functions. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fuse: update backing_dev_info congestion stateMiklos Szeredi
Set the read and write congestion state if the request queue is close to blocking, and clear it when it's not. This prevents unnecessary blocking in readahead and (when writable mmaps are allowed) writeback. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17ext2 reservationsMartin J. Bligh
Val's cross-port of the ext3 reservations code into ext2. [mbligh@mbligh.org: Small type error for printk [akpm@linux-foundation.org: fix types, sync with ext3] [mbligh@mbligh.org: Bring ext2 reservations code in line with latest ext3] [akpm@linux-foundation.org: kill noisy printk] [akpm@linux-foundation.org: remember to dirty the gdp's block] [akpm@linux-foundation.org: cross-port the missed 5dea5176e5c32ef9f0d1a41d28427b3bf6881b3a] [akpm@linux-foundation.org: cross-port e6022603b9aa7d61d20b392e69edcdbbc1789969] [akpm@linux-foundation.org: Port the omitted 08fb306fe63d98eb86e3b16f4cc21816fa47f18e] [akpm@linux-foundation.org: Backport the missed 20acaa18d0c002fec180956f87adeb3f11f635a6] [akpm@linux-foundation.org: fixes] [cmm@us.ibm.com: fix reservation extension] [bunk@stusta.de: make ext2_get_blocks() static] [hugh@veritas.com: fix hang] [hugh@veritas.com: ext2_new_blocks should reset the reservation window size] [hugh@veritas.com: ext2 balloc: fix off-by-one against rsv_end] [hugh@veritas.com: grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such] [hugh@veritas.com: rbtree usage cleanup] [pbadari@us.ibm.com: Fix for ext2 reservation] [bunk@kernel.org: remove fs/ext2/balloc.c:reserve_blocks()] [hugh@veritas.com: ext2 balloc: use io_error label] Cc: "Martin J. Bligh" <mbligh@mbligh.org> Cc: Valerie Henson <val_henson@linux.intel.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17introduce I_SYNCJoern Engel
I_LOCK was used for several unrelated purposes, which caused deadlock situations in certain filesystems as a side effect. One of the purposes now uses the new I_SYNC bit. Also document the various bits and change their order from historical to logical. [bunk@stusta.de: make fs/inode.c:wake_up_inode() static] Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: David Chinner <dgc@sgi.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: introduce writeback_control.more_io to indicate more ioFengguang Wu
After making dirty a 100M file, the normal behavior is to start the writeback for all data after 30s delays. But sometimes the following happens instead: - after 30s: ~4M - after 5s: ~4M - after 5s: all remaining 92M Some analyze shows that the internal io dispatch queues goes like this: s_io s_more_io ------------------------- 1) 100M,1K 0 2) 1K 96M 3) 0 96M 1) initial state with a 100M file and a 1K file 2) 4M written, nr_to_write <= 0, so write more 3) 1K written, nr_to_write > 0, no more writes(BUG) nr_to_write > 0 in (3) fools the upper layer to think that data have all been written out. The big dirty file is actually still sitting in s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes in s_dirty. It is also not an option to draw inodes from both s_more_io and s_dirty, an let the loop go on: this might lead to live locks, and might also starve other superblocks in sync time(well kupdate may still starve some superblocks, that's another bug). We have to return when a full scan of s_io completes. So nr_to_write > 0 does not necessarily mean that "all data are written". This patch introduces a flag writeback_control.more_io to indicate this situation. With it the big dirty file no longer has to wait for the next kupdate invocation 5s later. Cc: David Chinner <dgc@sgi.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: remove pages_skipped accounting in __block_write_full_page()Fengguang Wu
Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug: > The following strange behavior can be observed: > > 1. large file is written > 2. after 30 seconds, nr_dirty goes down by 1024 > 3. then for some time (< 30 sec) nothing happens (disk idle) > 4. then nr_dirty again goes down by 1024 > 5. repeat from 3. until whole file is written > > So basically a 4Mbyte chunk of the file is written every 30 seconds. > I'm quite sure this is not the intended behavior. It can be produced by the following test scheme: # cat bin/test-writeback.sh grep nr_dirty /proc/vmstat echo 1 > /proc/sys/fs/inode_debug dd if=/dev/zero of=/var/x bs=1K count=204800& while true; do grep nr_dirty /proc/vmstat; sleep 1; done # bin/test-writeback.sh nr_dirty 19207 nr_dirty 19207 nr_dirty 30924 204800+0 records in 204800+0 records out 209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s nr_dirty 47150 nr_dirty 47141 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47205 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47214 nr_dirty 47215 nr_dirty 47216 nr_dirty 47216 nr_dirty 47216 nr_dirty 47154 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47143 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47142 nr_dirty 47134 nr_dirty 47134 nr_dirty 47135 nr_dirty 47135 nr_dirty 47135 nr_dirty 46097 <== -1038 nr_dirty 46098 nr_dirty 46098 nr_dirty 46098 [...] nr_dirty 46091 nr_dirty 46092 nr_dirty 46092 nr_dirty 45069 <== -1023 nr_dirty 45056 nr_dirty 45056 nr_dirty 45056 [...] nr_dirty 37822 nr_dirty 36799 <== -1023 [...] nr_dirty 36781 nr_dirty 35758 <== -1023 [...] nr_dirty 34708 nr_dirty 33672 <== -1024 [...] nr_dirty 33692 nr_dirty 32669 <== -1023 % ls -li /var/x 847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x % dmesg|grep 847824 # generated by a debug printk [ 529.263184] redirtied inode 847824 line 548 [ 564.250872] redirtied inode 847824 line 548 [ 594.272797] redirtied inode 847824 line 548 [ 629.231330] redirtied inode 847824 line 548 [ 659.224674] redirtied inode 847824 line 548 [ 689.219890] redirtied inode 847824 line 548 [ 724.226655] redirtied inode 847824 line 548 [ 759.198568] redirtied inode 847824 line 548 # line 548 in fs/fs-writeback.c: 543 if (wbc->pages_skipped != pages_skipped) { 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ 548 redirty_tail(inode); 549 } More debug efforts show that __block_write_full_page() never has the chance to call submit_bh() for that big dirty file: the buffer head is *clean*. So basicly no page io is issued by __block_write_full_page(), hence pages_skipped goes up. Also the comment in generic_sync_sb_inodes(): 544 /* 545 * writeback is not making progress due to locked 546 * buffers. Skip this inode for now. 547 */ and the comment in __block_write_full_page(): 1713 /* 1714 * The page was marked dirty, but the buffers were 1715 * clean. Someone wrote them back by hand with 1716 * ll_rw_block/submit_bh. A rare case. 1717 */ do not quite agree with each other. The page writeback should be skipped for 'locked buffer', but here it is 'clean buffer'! This patch fixes this bug. Though I'm not sure why __block_write_full_page() is called only to do nothing and who actually issued the writeback for us. This is the two possible new behaviors after the patch: 1) pretty nice: wait 30s and write ALL:) 2) not so good: - during the dd: ~16M - after 30s: ~4M - after 5s: ~4M - after 5s: ~176M The next patch will fix case (2). Cc: David Chinner <dgc@sgi.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix ntfs with sb_has_dirty_inodes()Fengguang Wu
NTFS's if-condition on dirty inodes is not complete. Fix it with sb_has_dirty_inodes(). Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time ordering of the per superblock inode lists 8Fengguang Wu
Streamline the management of dirty inode lists and fix time ordering bugs. The writeback logic used to move not-yet-expired dirty inodes from s_dirty to s_io, *only to* move them back. The move-inodes-back-and-forth thing is a mess, which is eliminated by this patch. The new scheme is: - s_dirty acts as a time ordered io delaying queue; - s_io/s_more_io together acts as an io dispatching queue. On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of every full scan of s_io. Otherwise (i.e. for sync/throttle/background writeback), we always pull from s_dirty on each run (a partial scan). Note that the line list_splice_init(&sb->s_more_io, &sb->s_io); is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will sit in s_io for a long time, preventing new expired inodes to get in. Cc: Ken Chen <kenchen@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix periodic superblock dirty inode flushingKen Chen
Current -mm tree has bucketful of bug fixes in periodic writeback path. However, we still hit a glitch where dirty pages on a given inode aren't completely flushed to the disk, and system will accumulate large amount of dirty pages beyond what dirty_expire_interval is designed for. The problem is __sync_single_inode() will move an inode to sb->s_dirty list even when there are more pending dirty pages on that inode. If there is another inode with a small number of dirty pages, we hit a case where the loop iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0. Thus leaving the inode that has large amount of dirty pages behind and it has to wait for another dirty_writeback_interval before we flush it again. We effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval. If the rate of dirtying is sufficiently high, the system will start accumulate a large number of dirty pages. So fix it by having another sb->s_more_io list on which to park the inode while we iterate through sb->s_io and to allow each dirty inode which resides on that sb to have an equal chance of flushing some amount of dirty pages. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time ordering of the per superblock dirty inode lists 7Andrew Morton
This one fixes four bugs. There are a few situation in there where writeback decides it is going to skip over a blockdev inode on the kernel-internal blockdev superblock. It presently does this by moving the blockdev inode onto the tail of the blockdev superblock's s_dirty. But a) this screws up s_dirty's reverse-time-orderedness and b) refiling the blockdev for writeback in another 30 second is rude. We should try again sooner than that. Fix all this up by using redirty_head(): move the blockdev inode onto the head of the blockdev superblock's s_dirty list for prompt writeback. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time ordering of the per superblock dirty inode lists 6Andrew Morton
Recycling the previous changelog: When the writeback function is operating in writeback-for-flushing mode (as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it will skip writing that inode. This is done for throughput and latency: move on to another inode rather than blocking for this one. Writeback skips this inode by moving it off s_io and onto s_dirty, so that writeback can proceed with the other inodes on s_io. However that inode movement can corrupt s_dirty's reverse-time-orderedness. Fix that by using the new redirty_tail(), which will update the refiled inode's dirtied_when field. Note: the behaviour in here is a bit rude: if kupdate happens to come across a locked inode then it will defer writeback of that inode for another 30 seconds. We'll address that in the next patch. Address that here. What we do is to move the skipped inode to the _head_ of s_dirty, immediately eligible for writeout again. Instead of deferring that writeout for another 30 seconds. One would think that this might cause a livelock: we keep on trying to write the same locked inode. But it won't because: a) if that was the case, it would _already_ be happening on the balance_dirty_pages codepath. Because balance_dirty_pages() doesn't care about inode timestamps. b) if we skipped this inode then we won't have done any writeback. The higher-level writeback paths will see that wbc.nr_to_write didn't change and they'll then back off and take a nap. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time ordering of the per superblock dirty inode lists 5Andrew Morton
When the writeback function is operating in writeback-for-flushing mode (as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it will skip writing that inode. This is done for throughput and latency: move on to another inode rather than blocking for this one. Writeback skips this inode by moving it off s_io and onto s_dirty, so that writeback can proceed with the other inodes on s_io. However that inode movement can corrupt s_dirty's reverse-time-orderedness. Fix that by using the new redirty_tail(), which will update the refiled inode's dirtied_when field. Note: the behaviour in here is a bit rude: if kupdate happens to come across a locked inode then it will defer writeback of that inode for another 30 seconds. We'll address that in the next patch. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix comment, use helper functionAndrew Morton
There's a comment in there which claims that the inode is left on s_io if nfs chickened out of writing some data. But that's not been true for three years. 9290280ced13c85689adeffa587e9a53bd3a5873 fixed a livelock by moving these inodes back onto s_dirty. Fix the comment. In the second leg of the `if', use redirty_tail() rather than open-coding it. Add weaselly comment indicating lack of confidence in the code and lack of the fortitude which would be needed to fiddle with it. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time ordering of the per superblock dirty inode lists 4Andrew Morton
When the kupdate function has tried to write back an expired inode it will then check to see whether some of the inode's pages are still dirty. This can happen when the filesystem decided to not write a page for some reason. But it does _not_ occur due to redirtyings: a redirtying will set I_DIRTY_PAGES. What we need to do here is to set I_DIRTY_PAGES to reflect reality and to then put the inode onto the _head_ of s_dirty for consideration on the next kupdate pass, in five seconds time. Problem is, the code failed to modify the inode's timestamp when pushing the inode onto thehead of s_dirty. The patch: If there are no other inodes on s_dirty then we leave the inode's timestamp alone: it is already expired. If there _are_ other inodes on s_dirty then we arrange for this inode to get the same timestamp as the inode which is at the head of s_dirty, thus preserving the s_dirty ordering. But we only need to do this if this inode purports to have been dirtied before the one at head-of-list. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time ordering of the per superblock dirty inode lists 3Andrew Morton
While writeback is working against a dirty inode it does a check after trying to write some of the inode's pages: "did the lower layers skip some of the inode's dirty pages because they were locked (or under writeback, or whatever)" If this turns out to be true, we must move the inode back onto s_dirty and redirty it. The reason for doing this is that fsync() and friends only check the s_dirty list, and those functions want to know about those pages which were locked, so they can be waited upon and, if necessary, rewritten. Problem is, that redirtying was putting the inode onto the tail of s_dirty without updating its timestamp. This causes a violation of s_dirty ordering. Fix this by updating inode->dirtied_when when moving the inode onto s_dirty. But the code is still a bit buggy? If the inode was _already_ dirty then we don't need to move it at all. Oh well, hopefully it doesn't matter too much, as that was a redirtying, which was very recent anwyay. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time ordering of the per superblock dirty inode lists: ↵Andrew Morton
memory-backed inodes For reasons which escape me, inodes which are dirty against a ram-backed filesystem are managed in the same way as inodes which are backed by real devices. Probably we could optimise things here. But given that we skip the entire supeblock as son as we hit the first dirty inode, there's not a lot to be gained. And the code does need to handle one particular non-backed superblock: the kernel's fake internal superblock which holds all the blockdevs. Still. At present when the code encounters an inode which is dirty against a memory-backed filesystem it will skip that inode by refiling it back onto s_dirty. But it fails to update the inode's timestamp when doing so which at least makes the debugging code upset. Fix. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17writeback: fix time-ordering of the per-superblock dirty-inode listsAndrew Morton
When writeback has finished writing back an inode it looks to see if that inode is still dirty. If it is, that means that a process redirtied the inode while its writeback was in progress. What we need to do here is to refile the redirtied inode onto the s_dirty list. But we're doing that wrongly: it could be that this inode was redirtied _before_ the last inode on s_dirty. We're blindly appending this inode to the list, after an inode which might be less-recently-dirtied, thus violating the list's ordering. So we must either insertion-sort this inode into the correct place, or we must update this inode's dirtied_when field when appending it to the reverse-sorted s_dirty list, to preserve the reverse-time-ordering. This patch does the latter: if this inode was dirtied less recently than the tail inode then copy the tail inode's timestamp into this inode. This means that in rare circumstances, some inodes will be writen back later than they should have been. But the time slip will be small. Cc: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17reiserfs: do not repair wrong journal paramsEdward Shishkin
When mounting a file system with wrong journal params do not try to repair them, suggest fsck instead. Signed-off-by: Edward Shishkin <edward@namesys.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17ext3: lighten up resize transaction requirementsEric Sandeen
When resizing online, setup_new_group_blocks attempts to reserve a potentially very large transaction, depending on the current filesystem geometry. For some journal sizes, there may not be enough room for this transaction, and the online resize will fail. The patch below resizes & restarts the transaction as necessary while setting up the new group, and should work with even the smallest journal. Tested with something like: [root@newbox ~]# dd if=/dev/zero of=fsfile bs=1024 count=32768 [root@newbox ~]# mkfs.ext3 -b 1024 fsfile 16384 [root@newbox ~]# mount -o loop fsfile mnt/ [root@newbox ~]# resize2fs /dev/loop0 resize2fs 1.40.2 (12-Jul-2007) Filesystem at /dev/loop0 is mounted on /root/mnt; on-line resizing required old desc_blocks = 1, new_desc_blocks = 1 Performing an on-line resize of /dev/loop0 to 32768 (1k) blocks. resize2fs: No space left on device While trying to add group #2 [root@newbox ~]# dmesg | tail -n 1 JBD: resize2fs wants too many credits (258 > 256) [root@newbox ~]# With the below change, it works. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Andreas Dilger <adilger@clusterfs.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17F_DUPFD_CLOEXEC implementationUlrich Drepper
One more small change to extend the availability of creation of file descriptors with FD_CLOEXEC set. Adding a new command to fcntl() requires no new system call and the overall impact on code size if minimal. If this patch gets accepted we will also add this change to the next revision of the POSIX spec. To test the patch, use the following little program. Adjust the value of F_DUPFD_CLOEXEC appropriately. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #ifndef F_DUPFD_CLOEXEC # define F_DUPFD_CLOEXEC 12 #endif int main (int argc, char *argv[]) { if (argc > 1) { if (fcntl (3, F_GETFD) == 0) { puts ("descriptor not closed"); exit (1); } if (errno != EBADF) { puts ("error not EBADF"); exit (1); } exit (0); } int fd = fcntl (STDOUT_FILENO, F_DUPFD_CLOEXEC, 0); if (fd == -1 && errno == EINVAL) { puts ("F_DUPFD_CLOEXEC not supported"); return 0; } if (fd != 3) { puts ("program called with descriptors other than 0,1,2"); return 1; } execl ("/proc/self/exe", "/proc/self/exe", "1", NULL); puts ("execl failed"); return 1; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: <linux-arch@vger.kernel.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17spin_lock_unlocked cleanupsRoel Kluin
Replace some SPIN_LOCK_UNLOCKED with DEFINE_SPINLOCK Signed-off-by: Roel Kluin <12o3l@tiscali.nl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Break ELF_PLATFORM and stack pointer randomization dependencyFranck Bui-Huu
Currently arch_align_stack() is used by fs/binfmt_elf.c to randomize stack pointer inside a page. But this happens only if ELF_PLATFORM symbol is defined. ELF_PLATFORM is normally set if the architecture wants ld.so to load implementation specific libraries for optimization. And currently a lot of architectures just yield this symbol to NULL. This is the case for MIPS architecture where ELF_PLATFORM is NULL but arch_align_stack() has been redefined to do stack inside page randomization. So in this case no randomization is actually done. This patch breaks this dependency which seems to be useless and allows platforms such MIPS to do the randomization. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17rename signalfd_siginfo fieldsDavide Libenzi
For Michael Kerrisk request, the following patch renames signalfd_siginfo fields in order to keep them consistent with the siginfo_t ones. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17ext3: remove #ifdef CONFIG_EXT3_INDEXEric Sandeen
CONFIG_EXT3_INDEX is not an exposed config option in the kernel, and it is unconditionally defined in ext3_fs.h. tune2fs is already able to turn off dir indexing, so at this point it's just cluttering up the code. Remove it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fs: correct SuS compliance for open of large file without optionsAlan Cox
The early LFS work that Linux uses favours EFBIG in various places. SuSv3 specifically uses EOVERFLOW for this as noted by Michael (Bug 7253) [EOVERFLOW] The named file is a regular file and the size of the file cannot be represented correctly in an object of type off_t. We should therefore transition to the proper error return code Signed-off-by: Alan Cox <alan@redhat.com> Cc: Theodore Tso <tytso@mit.edu> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17anon-inodes use open coded atomic_inc for the shared inodeDavide Libenzi
Since we know the shared inode count is always >0, we can avoid igrab() and use an open coded atomic_inc(). This also fixes a bug noticed by Yan Zheng <yanzheng@21cn.com>: were checking for an IS_ERR() return from igrab(), but it actually returns NULL on error. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Yan Zheng <yanzheng@21cn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Don't truncate /proc/PID/environ at 4096 charactersJames Pearson
/proc/PID/environ currently truncates at 4096 characters, patch based on the /proc/PID/mem code. Signed-off-by: James Pearson <james-p@moving-picture.com> Cc: Anton Arapov <aarapov@redhat.com> Cc: Jan Engelhardt <jengelh@computergmbh.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fs/udf/balloc.c: mark a variable as uninitialized_var()WANG Cong
Kill a may-be-used-uninitialized warning. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17menuconfig: transform Network Filesystems menuJan Engelhardt
Turn Network File Systems into a menuconfig so that it can be disabled at once. (Note: I added a "default y". If you do not like that, speak up.) Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Steven French <sfrench@us.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Eric Van Hensbergen <ericvh@hera.kernel.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17menuconfig: transform NLS and DLM menusJan Engelhardt
Changes NLS and DLM menus into a 'menuconfig' object so that it can be disabled at once without having to enter the menu first to disable the config option. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17change inotifyfs magic as the same magic is used for futexfsAndrey Mirkin
Right now futexfs and inotifyfs have one magic 0xBAD1DEA, that looks a little bit confusing. Use 0xBAD1DEA as magic for futexfs and 0x2BAD1DEA as magic for inotifyfs. Signed-off-by: Andrey Mirkin <major@openvz.org> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17increase AT_VECTOR_SIZE to terminate saved_auxv properlyOlaf Hering
include/asm-powerpc/elf.h has 6 entries in ARCH_DLINFO. fs/binfmt_elf.c has 14 unconditional NEW_AUX_ENT entries and 2 conditional NEW_AUX_ENT entries. So in the worst case, saved_auxv does not get an AT_NULL entry at the end. The saved_auxv array must be terminated with an AT_NULL entry. Make the size of mm_struct->saved_auxv arch dependend, based on the number of ARCH_DLINFO entries. Signed-off-by: Olaf Hering <olh@suse.de> Cc: Roland McGrath <roland@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17vfs: use the predefined d_unhashed inline function insteadDenis Cheng
Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17UDF: coding style fixupsCyrill Gorcunov
This patch does additional coding style fixup. Initially the code is being distorted by Lindent (in my patches sent not very long ago) and fixed in the followup patches but this stuff was accidently missed. New and old compiled files were compared with cmp to check for being identically. So the patch will not break the kernel. Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fs/isofs/namei.c: Remove uninitialized local vars warningBorislav Petkov
shut up those: fs/isofs/namei.c: In function 'isofs_lookup': fs/isofs/namei.c:161: warning: 'offset' may be used uninitialized in this function fs/isofs/namei.c:161: warning: 'block' may be used uninitialized in this function By the way, they get overwritten at the end of isofs_find_entry(). Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>