summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2007-07-17Freezer: make kernel threads nonfreezable by defaultRafael J. Wysocki
Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17fs: introduce some page/buffer invariantsNick Piggin
It is a bug to set a page dirty if it is not uptodate unless it has buffers. If the page has buffers, then the page may be dirty (some buffers dirty) but not uptodate (some buffers not uptodate). The exception to this rule is if the set_page_dirty caller is racing with truncate or invalidate. A buffer can not be set dirty if it is not uptodate. If either of these situations occurs, it indicates there could be some data loss problem. Some of these warnings could be a harmless one where the page or buffer is set uptodate immediately after it is dirtied, however we should fix those up, and enforce this ordering. Bring the order of operations for truncate into line with those of invalidate. This will prevent a page from being able to go !uptodate while we're holding the tree_lock, which is probably a good thing anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17mm: clean up and kernelify shrinker registrationRusty Russell
I can never remember what the function to register to receive VM pressure is called. I have to trace down from __alloc_pages() to find it. It's called "set_shrinker()", and it needs Your Help. 1) Don't hide struct shrinker. It contains no magic. 2) Don't allocate "struct shrinker". It's not helpful. 3) Call them "register_shrinker" and "unregister_shrinker". 4) Call the function "shrink" not "shrinker". 5) Reduce the 17 lines of waffly comments to 13, but document it properly. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: David Chinner <dgc@sgi.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Lumpy Reclaim V4Andy Whitcroft
When we are out of memory of a suitable size we enter reclaim. The current reclaim algorithm targets pages in LRU order, which is great for fairness at order-0 but highly unsuitable if you desire pages at higher orders. To get pages of higher order we must shoot down a very high proportion of memory; >95% in a lot of cases. This patch set adds a lumpy reclaim algorithm to the allocator. It targets groups of pages at the specified order anchored at the end of the active and inactive lists. This encourages groups of pages at the requested orders to move from active to inactive, and active to free lists. This behaviour is only triggered out of direct reclaim when higher order pages have been requested. This patch set is particularly effective when utilised with an anti-fragmentation scheme which groups pages of similar reclaimability together. This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms the foundation. Credit to Mel Gorman for sanitity checking. Mel said: The patches have an application with hugepage pool resizing. When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can be resized with greater reliability. Testing on a desktop machine with 2GB of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own was very slow as the success rate was quite low. Without lumpy-reclaim, each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages. With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical. [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup] [bunk@stusta.de: static declarations for internal functions] [a.p.zijlstra@chello.nl: initial lumpy V2 implementation] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Add __GFP_MOVABLE for callers to flag allocations from high memory that may ↵Mel Gorman
be migrated It is often known at allocation time whether a page may be migrated or not. This patch adds a flag called __GFP_MOVABLE and a new mask called GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated using the page migration mechanism or reclaimed by syncing with backing storage and discarding. An API function very similar to alloc_zeroed_user_highpage() is added for __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The flags used by alloc_zeroed_user_highpage() are not changed because it would change the semantics of an existing API. After this patch is applied there are no in-kernel users of alloc_zeroed_user_highpage() so it probably should be marked deprecated if this patch is merged. Note that this patch includes a minor cleanup to the use of __GFP_ZERO in shmem.c to keep all flag modifications to inode->mapping in the shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of Hugh Dickens. Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the concept. Credit to Hugh Dickens for catching issues with shmem swap vector and ramfs allocations. [akpm@linux-foundation.org: build fix] [hugh@veritas.com: __GFP_ZERO cleanup] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16utime(s): Honour CAP_FOWNER when times==NULLSatyam Sharma
do_utimes() does not honour CAP_FOWNER when times==NULL. Trivial and obvious one-line fix. Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Fix LDM for new field in the VOL5 VBLK.Anton Altaparmakov
Teach LDM about a new field encountered with Windows Vista. This fixes LDM for people using Vista who have disabled drive letter assignment from one or more volumes. Doing this introduces a so far unknown field in the LDM database in the VOL5 VBLK structure which causes the LDM driver to fail to parse the VBLK structure and hence LDM fails to parse the disk altogether. This patch teaches the driver about this field. Thanks got to Ashton Mills <amills@iinet.com.au> for reporting the problem and working with me on getting it fixed. It is now working for him. Signed-off-by: Anton Altaparmakov <aia21@cantab.net> CC: Richard Russon <ldm@flatcap.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-schedLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: [PATCH] sched: fix up fs/proc/array.c whitespace problems [PATCH] sched: prettify prio_to_wmult[] [PATCH] sched: document prio_to_wmult[] [PATCH] sched: improve weight-array comments [PATCH] sched: remove dead code from task_stime() Fixed up trivial conflict in fs/proc/array.c
2007-07-16Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (32 commits) [PATCH] ocfs2: zero_user_page conversion ocfs2: Support xfs style space reservation ioctls ocfs2: support for removing file regions ocfs2: update truncate handling of partial clusters ocfs2: btree support for removal of arbirtrary extents ocfs2: Support creation of unwritten extents ocfs2: support writing of unwritten extents ocfs2: small cleanup of ocfs2_write_begin_nolock() ocfs2: btree changes for unwritten extents ocfs2: abstract btree growing calls ocfs2: use all extent block suballocators ocfs2: plug truncate into cached dealloc routines ocfs2: simplify deallocation locking ocfs2: harden buffer check during mapping of page blocks ocfs2: shared writeable mmap ocfs2: factor out write aops into nolock variants ocfs2: rework ocfs2_buffered_write_cluster() ocfs2: take ip_alloc_sem during entire truncate ocfs2: Add "preferred slot" mount option [KJ PATCH] Replacing memset(<addr>,0,PAGE_SIZE) with clear_page() in fs/ocfs2/dlm/dlmrecovery.c ...
2007-07-16Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: splice: direct splicing updates ppos twice more ACSI removal umem: Fix match of pci_ids in umem driver umem: Remove references to dead CONFIG_MM_MAP_MEMORY variable remove the documentation for the legacy CDROM drivers
2007-07-16Merge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6Linus Torvalds
* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6: (68 commits) sh: sh-rtc support for SH7709. sh: Revert __xdiv64_32 size change. sh: Update r7785rp defconfig. sh: Export div symbols for GCC 4.2 and ST GCC. sh: fix race in parallel out-of-tree build sh: Kill off dead mach.c for hp6xx. sh: hd64461.h cleanup and added comments. sh: Update the alignment when 4K stacks are used. sh: Add a .bss.page_aligned section for 4K stacks. sh: Don't let SH-4A clobber SH-4 CFLAGS. sh: Add parport stub for SuperIO ports. sh: Drop -Wa,-dsp for DSP tuning. sh: Update dreamcast defconfig. fb: pvr2fb: A few more __devinit annotations for PCI. fb: pvr2fb: Fix up section mismatch warnings. sh: Select IPR-IRQ for SH7091. sh: Correct __xdiv64_32/div64_32 return value size. sh: Fix timer-tmu build for SH-3. sh: Add cpu and mach links to CLEAN_FILES. sh: Preliminary support for the SH-X3 CPU. ...
2007-07-16fat: Fix the race of read/write the FAT12 entryOGAWA Hirofumi
FAT12 entry is 12bits, so it needs 2 phase to update the value. And writer and reader access it without any lock, so reader can get the half updated value. This fixes the long standing race condition by adding a global spinlock to only FAT12 for avoiding any impact against FAT16/32. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16compat32: ignore the LOOP_CLR_FD ioctlGeert Uytterhoeven
compat32: Ignore the LOOP_CLR_FD ioctl for the loop block device, to kill an annoying kernel message when e.g. busybox umount is used. Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16NLS: Remove obsolete Makefile entriesRobert P. J. Day
Since the corresponding source files no longer exist, remove the irrelevant Makefile entries for them. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext4: statfs speed upBadari Pulavarty
This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext3: statfs speed upBadari Pulavarty
This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext2: statfs speed upBadari Pulavarty
This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Fix trivial typos in anon_inodes.c commentsJ. Bruce Fields
Trivial typo and grammar fixes. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext4: fix error handling in ext4_create_journalBorislav Petkov
Fix error handling in ext4_create_journal according to kernel conventions. Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext3: fix error handling in ext3_create_journal()Borislav Petkov
Fix error handling in ext3_create_journal according to kernel conventions. Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16UDF: fix function name from udf_crc16 to udf_crcCyrill Gorcunov
We have to change udf_crc16() name to udf_crc() to be able to play with CRC test. Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16is_power_of_2: ufs/super.cvignesh babu
Replace (n & (n-1)) with is_power_of_2 Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Acked-by: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16move seccomp from /proc to a prctlAndrea Arcangeli
This reduces the memory footprint and it enforces that only the current task can enable seccomp on itself (this is a requirement for a strightforward [modulo preempt ;) ] TIF_NOTSC implementation). Signed-off-by: Andrea Arcangeli <andrea@cpushare.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16bd_claim_by_disk: fix warningAndrew Morton
Fix this: fs/block_dev.c: In function 'bd_claim_by_disk': fs/block_dev.c:970: warning: 'found' may be used uninitialized in this function and given that free_bd_holder() now needs free(NULL)-is-legal behaviour, we can simplify bd_release_from_kobject(). Cc: Bjorn Steinbrink <B.Steinbrink@gmx.de> Cc: Johannes Weiner <hannes-kernel@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Replace obscure constructs in fs/block_dev.cJohannes Weiner
Replace some funky codepaths in fs/block_dev.c with cleaner versions of the affected places. [akpm@linux-foundation.org: fix return value] Signed-off-by: Johannes Weiner <hannes-kernel@saeurebad.de> Cc: Bjorn Steinbrink <B.Steinbrink@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16fs/namespace.c should #include "internal.h"Adrian Bunk
Every file should include the headers containing the prototypes for its global functions. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16HFS+: add custom dentry hash and comparison operationsDuane Griffin
Add custom dentry hash and comparison operations for HFS+ filesystems that are case-insensitive and/or do automatic unicode decomposition. The new operations reuse the existing HFS+ ASCII to unicode conversion, unicode decomposition and case folding functionality. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16HFS+: refactor ASCII to unicode conversion routine for later reuseDuane Griffin
The HFS+ filesystem is case-insensitive and does automatic unicode decomposition by default, but does not provide custom dentry operations. This can lead to multiple dentries being cached for lookups on a filename with varying case and/or character (de)composition. These patches add custom dentry hash and comparison operations for case-sensitive and/or automatically decomposing HFS+ filesystems. Unicode decomposition and case-folding are performed as required to ensure equivalent filenames are hashed to the same values and compare as equal. This patch: Refactor existing HFS+ ASCII to unicode string conversion routine to split out character conversion functionality. This will be reused by the custom dentry hash and comparison routines. This approach avoids unnecessary memory allocation compared to using the string conversion routine directly in the new functions. [akpm@linux-foundation.org: avoid use-of-uninitialised] Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16mistaken ext4_inode_bitmap for ext4_block_bitmapToshiyuki Okajima
In ext4_new_blocks(), one of two ext4_block_bitmap() calls should be ext4_inode_bitmap() call. It is not harmful in normal processing, but it should be fixed. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16is_power_of_2(): jbdvignesh babu
Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2(). Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16is_power_of_2: ext3/super.cvignesh babu
Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2() Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16drop obsolete sys_ioctl exportChristoph Hellwig
sys_ioctl() was only exported for our first version of compat ioctl handling. Now that the whole compat ioctl handling mess is more or less sorted out there are no more modular users left and we can kill it. There's one exception and that's sparc64's solaris compat module, but sparc64 has it's own export predating the generic one by years for that which this patch leaves untouched. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16namespace: ensure clone_flags are always stored in an unsigned longEric W. Biederman
While working on unshare support for the network namespace I noticed we were putting clone flags in an int. Which is weird because the syscall uses unsigned long and we at least need an unsigned to properly hold all of the unshare flags. So to make the code consistent, this patch updates the code to use unsigned long instead of int for the clone flags in those places where we get it wrong today. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext3: remove extra IS_RDONLY() checkDave Hansen
ext3_change_inode_journal_flag() is only called from one location: ext3_ioctl(EXT3_IOC_SETFLAGS). That ioctl case already has a IS_RDONLY() call in it so this one is superfluous. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16diskquota: 32bit quota tools on 64bit architecturesVasily Tarasov
OpenVZ Linux kernel team has discovered the problem with 32bit quota tools working on 64bit architectures. In 2.6.10 kernel sys32_quotactl() function was replaced by sys_quotactl() with the comment "sys_quotactl seems to be 32/64bit clean, enable it for 32bit" However this isn't right. Look at if_dqblk structure: struct if_dqblk { __u64 dqb_bhardlimit; __u64 dqb_bsoftlimit; __u64 dqb_curspace; __u64 dqb_ihardlimit; __u64 dqb_isoftlimit; __u64 dqb_curinodes; __u64 dqb_btime; __u64 dqb_itime; __u32 dqb_valid; }; For 32 bit quota tools sizeof(if_dqblk) == 0x44. But for 64 bit kernel its size is 0x48, 'cause of alignment! Thus we got a problem. Attached patch reintroduce sys32_quotactl() function, that handles this and related situations. [michal.k.k.piotrowski@gmail.com: build fix] [akpm@linux-foundation.org: Make it link with CONFIG_QUOTA=n] Signed-off-by: Vasily Tarasov <vtaras@openvz.org> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jan Kara <jack@ucw.cz> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext4: fix deadlock in ext4_remount() and orphan list handlingJan Kara
ext4_orphan_add() and ext4_orphan_del() functions lock sb->s_lock with a transaction started with ext4_mark_recovery_complete() waits for a transaction holding sb->s_lock, thus leading to a possible deadlock. At the moment we call ext4_mark_recovery_complete() from ext4_remount() we have done all the work needed for remounting and thus we are safe to drop sb->s_lock before we wait for transactions to commit. Note that at this moment we are still guarded by s_umount lock against other remounts/umounts. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Sandeen <sandeen@sandeen.net> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext3: fix deadlock in ext3_remount() and orphan list handlingJan Kara
ext3_orphan_add() and ext3_orphan_del() functions lock sb->s_lock with a transaction started with ext3_mark_recovery_complete() waits for a transaction holding sb->s_lock, thus leading to a possible deadlock. At the moment we call ext3_mark_recovery_complete() from ext3_remount() we have done all the work needed for remounting and thus we are safe to drop sb->s_lock before we wait for transactions to commit. Note that at this moment we are still guarded by s_umount lock against other remounts/umounts. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Sandeen <sandeen@sandeen.net> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16fix create_new_namespaces() return valueCedric Le Goater
dup_mnt_ns() and clone_uts_ns() return NULL on failure. This is wrong, create_new_namespaces() uses ERR_PTR() to catch an error. This means that the subsequent create_new_namespaces() will hit BUG_ON() in copy_mnt_ns() or copy_utsname(). Modify create_new_namespaces() to also use the errors returned by the copy_*_ns routines and not to systematically return ENOMEM. [oleg@tv-sign.ru: better changelog] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16binfmt_elf warning fixAndrew Morton
fs/binfmt_elf.c: In function 'load_elf_binary': fs/binfmt_elf.c:1002: warning: 'interp_map_addr' may be used uninitialized in this function The compiler (gcc-4.1.0) is correct, but it failed to notice that we didn't use the resulting value. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16revert "vanishing ioctl handler debugging"Andrew Morton
Revert my do_ioctl() debugging patch: Paul fixed the bug. Cc: Paul Fulghum <paulkf@microgate.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16hugetlbfs: handle empty options stringLee Schermerhorn
I was seeing a null pointer deref in fs/super.c:vfs_kern_mount(). Some file system get_sb() handler was returning NULL mnt_sb with a non-negative return value. I also noticed a "hugetlbfs: Bad mount option:" message in the log. Turns out that hugetlbfs_parse_options() was not checking for an empty option string after call to strsep(). On failure, hugetlbfs_parse_options() returns 1. hugetlbfs_fill_super() just passed this return code back up the call stack where vfs_kern_mount() missed the error and proceeded with a NULL mnt_sb. Apparently introduced by patch: hugetlbfs-use-lib-parser-fix-docs.patch The problem was exposed by this line in my fstab: none /huge hugetlbfs defaults 0 0 It can also be demonstrated by invoking mount of hugetlbfs directly with no options or a bogus option. This patch: 1) adds the check for empty option to hugetlbfs_parse_options(), 2) enhances the error message to bracket any unrecognized option with quotes , 3) modifies hugetlbfs_parse_options() to return -EINVAL on any unrecognized option, 4) adds a BUG_ON() to vfs_kern_mount() to catch any get_sb() handler that returns a NULL mnt->mnt_sb with a return value >= 0. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16hugetlbfs: use lib/parser, fix docsRandy Dunlap
Use lib/parser.c to parse hugetlbfs mount options. Correct docs in hugetlbpage.txt. old size of hugetlbfs_fill_super: 675 bytes new size of hugetlbfs_fill_super: 686 bytes (hugetlbfs_parse_options() is inlined) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Adam Litke <agl@us.ibm.com> Acked-by: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16HFSPlus: change kmalloc/memset to kzallocWyatt Banks
Removed kmalloc and memset in favor of kzalloc. To explain the HFSPLUS_SB() macro in the removed memset call: hfsplus_fs.h:#define HFSPLUS_SB(super) (*(struct hfsplus_sb_info *)(super)->s_fs_info) Signed-off-by: Wyatt Banks <wyatt@banksresearch.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16taskstats: add context-switch countersMaxim Uvarov
Make available to the user the following task and process performance statistics: * Involuntary Context Switches (task_struct->nivcsw) * Voluntary Context Switches (task_struct->nvcsw) Statistics information is available from: 1. taskstats interface (Documentation/accounting/) 2. /proc/PID/status (task only). This data is useful for detecting hyperactivity patterns between processes. [akpm@linux-foundation.org: cleanup] Signed-off-by: Maxim Uvarov <muvarov@ru.mvista.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Jonathan Lim <jlim@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext3/ext4: orphan list corruption due bad inodeVasily Averin
After ext3 orphan list check has been added into ext3_destroy_inode() (please see my previous patch) the following situation has been detected: EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0 Inode 00000101a15b7840: orphan list check failed! 00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f ... Call Trace: [<ffffffff80211ea9>] ext3_destroy_inode+0x79/0x90 [<ffffffff801a2b16>] sys_unlink+0x126/0x1a0 [<ffffffff80111479>] error_exit+0x0/0x81 [<ffffffff80110aba>] system_call+0x7e/0x83 First messages said that unlinked inode has i_nlink=0, then ext3_unlink() adds this inode into orphan list. Second message means that this inode has not been removed from orphan list. Inode dump has showed that i_fop = &bad_file_ops and it can be set in make_bad_inode() only. Then I've found that ext3_read_inode() can call make_bad_inode() without any error/warning messages, for example in the following case: ... if (inode->i_nlink == 0) { if (inode->i_mode == 0 || !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) { /* this inode is deleted */ brelse (bh); goto bad_inode; ... Bad inode can live some time, ext3_unlink can add it to orphan list, but ext3_delete_inode() do not deleted this inode from orphan list. As result we can have orphan list corruption detected in ext3_destroy_inode(). However it is not clear for me how to fix this issue correctly. As far as i see is_bad_inode() is called after iget() in all places excluding ext3_lookup() and ext3_get_parent(). I believe it makes sense to add bad inode check to these functions too and call iput if bad inode detected. Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16ext3/ext4: orphan list check on destroy_inodeVasily Averin
Customers claims to ext3-related errors, investigation showed that ext3 orphan list has been corrupted and have the reference to non-ext3 inode. The following debug helps to understand the reasons of this issue. [akpm@linux-foundation.org: update for print_hex_dump() changes] Signed-off-by: Vasily Averin <vvs@sw.ru> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Remove capability.h from mm.hAlexey Dobriyan
I forgot to remove capability.h from mm.h while removing sched.h! This patch remedies that, because the only inline function which was using CAP_something was made out of line. Cross-compile tested without regressions on: all powerpc defconfigs all mips defconfigs all m68k defconfigs all arm defconfigs all ia64 defconfigs alpha alpha-allnoconfig alpha-defconfig alpha-up arm i386 i386-allnoconfig i386-defconfig i386-up ia64 ia64-allnoconfig ia64-defconfig ia64-up m68k mips parisc parisc-allnoconfig parisc-defconfig parisc-up powerpc powerpc-up s390 s390-allnoconfig s390-defconfig s390-up sparc sparc-allnoconfig sparc-defconfig sparc-up sparc64 sparc64-allnoconfig sparc64-defconfig sparc64-up um-x86_64 x86_64 x86_64-allnoconfig x86_64-defconfig x86_64-up as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16seq_file: more atomicity in traverse()Alexey Dobriyan
Original problem: in some circumstances seq_file interface can present infinite proc file to the following script when normally said proc file is finite: while read line; do [do something with $line] done </proc/$FILE bash, to implement such loop does essentially read(0, buf, 128); [find \n] lseek(0, -difference, SEEK_CUR); Consider, proc file prints list of objects each of them consists of many lines, each line is shorter than 128 bytes. Two objects in list, with ->index'es being 0 and 1. Current one is 1, as bash prints second object line by line. Imagine first object being removed right before lseek(). traverse() will be called, because there is negative offset. traverse() will reset ->index to 0 (!). traverse() will call ->next() and get NULL in any usual iterate-over-list code using list_for_each_entry_continue() and such. There is one object in list now after all... traverse() will return 0, lseek() will update file position and pretend everything is OK. So, what we have now: ->f_pos points to place where second object will be printed, but ->index is 0. seq_read() instead of returning EOF, will start printing first line of first object every time it's called, until enough objects are added to ->f_pos return in bounds. Fix is to update ->index only after we're sure we saw enough objects down the road. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16O_CLOEXEC for SCM_RIGHTSUlrich Drepper
Part two in the O_CLOEXEC saga: adding support for file descriptors received through Unix domain sockets. The patch is once again pretty minimal, it introduces a new flag for recvmsg and passes it just like the existing MSG_CMSG_COMPAT flag. I think this bit is not used otherwise but the networking people will know better. This new flag is not recognized by recvfrom and recv. These functions cannot be used for that purpose and the asymmetry this introduces is not worse than the already existing MSG_CMSG_COMPAT situations. The patch must be applied on the patch which introduced O_CLOEXEC. It has to remove static from the new get_unused_fd_flags function but since scm.c cannot live in a module the function still hasn't to be exported. Here's a test program to make sure the code works. It's so much longer than the actual patch... #include <errno.h> #include <error.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/socket.h> #include <sys/un.h> #ifndef O_CLOEXEC # define O_CLOEXEC 02000000 #endif #ifndef MSG_CMSG_CLOEXEC # define MSG_CMSG_CLOEXEC 0x40000000 #endif int main (int argc, char *argv[]) { if (argc > 1) { int fd = atol (argv[1]); printf ("child: fd = %d\n", fd); if (fcntl (fd, F_GETFD) == 0 || errno != EBADF) { puts ("file descriptor valid in child"); return 1; } return 0; } struct sockaddr_un sun; strcpy (sun.sun_path, "./testsocket"); sun.sun_family = AF_UNIX; char databuf[] = "hello"; struct iovec iov[1]; iov[0].iov_base = databuf; iov[0].iov_len = sizeof (databuf); union { struct cmsghdr hdr; char bytes[CMSG_SPACE (sizeof (int))]; } buf; struct msghdr msg = { .msg_iov = iov, .msg_iovlen = 1, .msg_control = buf.bytes, .msg_controllen = sizeof (buf) }; struct cmsghdr *cmsg = CMSG_FIRSTHDR (&msg); cmsg->cmsg_level = SOL_SOCKET; cmsg->cmsg_type = SCM_RIGHTS; cmsg->cmsg_len = CMSG_LEN (sizeof (int)); msg.msg_controllen = cmsg->cmsg_len; pid_t child = fork (); if (child == -1) error (1, errno, "fork"); if (child == 0) { int sock = socket (PF_UNIX, SOCK_STREAM, 0); if (sock < 0) error (1, errno, "socket"); if (bind (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0) error (1, errno, "bind"); if (listen (sock, SOMAXCONN) < 0) error (1, errno, "listen"); int conn = accept (sock, NULL, NULL); if (conn == -1) error (1, errno, "accept"); *(int *) CMSG_DATA (cmsg) = sock; if (sendmsg (conn, &msg, MSG_NOSIGNAL) < 0) error (1, errno, "sendmsg"); return 0; } /* For a test suite this should be more robust like a barrier in shared memory. */ sleep (1); int sock = socket (PF_UNIX, SOCK_STREAM, 0); if (sock < 0) error (1, errno, "socket"); if (connect (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0) error (1, errno, "connect"); unlink (sun.sun_path); *(int *) CMSG_DATA (cmsg) = -1; if (recvmsg (sock, &msg, MSG_CMSG_CLOEXEC) < 0) error (1, errno, "recvmsg"); int fd = *(int *) CMSG_DATA (cmsg); if (fd == -1) error (1, 0, "no descriptor received"); char fdname[20]; snprintf (fdname, sizeof (fdname), "%d", fd); execl ("/proc/self/exe", argv[0], fdname, NULL); puts ("execl failed"); return 1; } [akpm@linux-foundation.org: Fix fastcall inconsistency noted by Michael Buesch] [akpm@linux-foundation.org: build fix] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Michael Buesch <mb@bu3sch.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Introduce O_CLOEXECUlrich Drepper
The problem is as follows: in multi-threaded code (or more correctly: all code using clone() with CLONE_FILES) we have a race when exec'ing. thread #1 thread #2 fd=open() fork + exec fcntl(fd,F_SETFD,FD_CLOEXEC) In some applications this can happen frequently. Take a web browser. One thread opens a file and another thread starts, say, an external PDF viewer. The result can even be a security issue if that open file descriptor refers to a sensitive file and the external program can somehow be tricked into using that descriptor. Just adding O_CLOEXEC support to open() doesn't solve the whole set of problems. There are other ways to create file descriptors (socket, epoll_create, Unix domain socket transfer, etc). These can and should be addressed separately though. open() is such an easy case that it makes not much sense putting the fix off. The test program: #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> #ifndef O_CLOEXEC # define O_CLOEXEC 02000000 #endif int main (int argc, char *argv[]) { int fd; if (argc > 1) { fd = atol (argv[1]); printf ("child: fd = %d\n", fd); if (fcntl (fd, F_GETFD) == 0 || errno != EBADF) { puts ("file descriptor valid in child"); return 1; } return 0; } fd = open ("/proc/self/exe", O_RDONLY | O_CLOEXEC); printf ("in parent: new fd = %d\n", fd); char buf[20]; snprintf (buf, sizeof (buf), "%d", fd); execl ("/proc/self/exe", argv[0], buf, NULL); puts ("execl failed"); return 1; } [kyle@parisc-linux.org: parisc fix] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Kyle McMartin <kyle@parisc-linux.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>